I modified TCL8.4.20 source in the following files in order to measure the run time of TCL script:
basic utility
Record the time:
void save()
{
struct timespec t;
clock_gettime(CLOCK_MONOTONIC_RAW, &t);
save_to_an_array();
}
tclMain.c, in Tcl_Main() function, before calling Tcl_FSEvalFile(), record time
tclBasic.c, in Tcl_EvalEx(), at the start, record time; there are multi exits, record time at each exit.
tclMain.c, before exiting Tcl_Main(), dump out all the recordings.
Build TCL source as usual and executable tclsh8.4 now has my builtin function to record the script run time and dump out the times at the end.
I use an one-liner script: puts hello
To my surprise, the run time varies greatly. Here is a consecutive time:
run1 - 232.00ms
run2 - 7886.00ms
run3 - 6973.00ms
run4 - 5749.00ms
run5 - 224.00ms
run6 - 6820.00ms
run7 - 6074.00ms
run8 - 221.00ms
Maybe bytecode version has better consistency? So I added more probes to Tcl_EvalObjEx and TclExecuteByteCode(). Here is the new script:
proc p {} {
puts hello
}
p
But it is not consistent either:
run1 - 226.00ms
run2 - 7877.00ms
run3 - 6964.00ms
run4 - 5740.00ms
run5 - 218.00ms
run6 - 6809.00ms
run7 - 6064.00ms
run8 - 216.00ms
Do you see what might be the problem?
[UPDATE]
Maybe puts is a bad choice since it is I/O function which is impacted by many system issues, so I changed the script to some random commands:
set a 100
set b 200
append a 300
array set arr1 {one 1 two 2}
It definitely is better:
run1 - 9.00ms
run2 - 9.00ms
run3 - 19.00ms
run4 - 9.00ms
run5 - 9.00ms
run6 - 9.00ms
run7 - 9.00ms
run8 - 9.00ms
run9 - 9.00ms
run10 - 9.00ms
But again, how does that run3 at 19ms come from?
The problem with using wall-clock timing is that it is very sensitive to whatever else is going on on your system. The OS can simply decide to suspend your process at any moment on the grounds of having other “more important” work for the CPU to do. It doesn't know that you're trying to do performance measurements.
The usual way of fixing this is to both do many runs of the timing script and to take the minimum of the measured timings, bearing in mind that the cost of doing timing at all is non-zero and can have an effect on the output.
The time command in standard Tcl is intended for this sort of thing. Here's an example of use:
puts [time {
set a 100
set b 200
append a 300
array set arr1 {one 1 two 2}
} 100]
This runs the code fragment from before 100 times and prints the average execution time. (In my performance-intensive tests, I'll use a whole bunch of stabilisation code so that I get reasonable information out of even microbenchmarks, but all they're really doing is guessing a good value for the iterations and printing the minimum of a bunch of samples. Also, microbenchmarks might well end up with iteration counts in the hundreds of thousands or millions.)
Be aware that you're using a version of Tcl that has been end-of-lifed. 8.5 is the current LTS version (i.e., it mostly only receives security fixes — if any; we don't have many vulns — and updates to support evolving OS APIs), and 8.6 is for new work. (8.7 and 9.0 are under development, but still pre-alpha.)
Related
I am using Ray 1.3.0 (for RLlib) with a combination of SUMO version 1.9.2 for the simulation of a multi-agent scenario. I have configured RLlib to use a single PPO network that is commonly updated/used by all N agents. My evaluation settings look like this:
# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": 20,
# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.
"evaluation_num_episodes": 10,
# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.
"evaluation_parallel_to_training": False,
# Internal flag that is set to True for evaluation workers.
"in_evaluation": True,
# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!
"evaluation_config": {
# Example: overriding env_config, exploration, etc:
"lr": 0, # To prevent any kind of learning during evaluation
"explore": True # As required by PPO (read IMPORTANT NOTE above)
},
# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).
"evaluation_num_workers": 1,
# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.
"custom_eval_function": None,
What happens is every 20 iterations (each iteration collecting "X" training samples), there is an evaluation run of a minimum of 10 episodes. The sum of reward received by all N agents is summed over these episodes and that is set as the reward sum for that particular evaluation run. Over time, I notice that there is a pattern with the reward sums that repeats over the same interval of evaluation runs continuously, and the learning goes nowhere.
UPDATE (23/06/2021)
Unfortunately, I did not have TensorBoard activated for that particular run but from the mean rewards that were collected during evaluations (that happens every 20 iterations) of 10 episodes each, it is clear that there is a repeating pattern as shown in the annotated plot below:
The 20 agents in the scenario should be learning to avoid colliding but instead continue to somehow stagnate at a certain policy and end up showing the exact same reward sequence during evaluation?
Is this a characteristic of how I have configured the evaluation aspect, or should I be checking something else? I would be grateful if anyone could advise or point me in the right direction.
Thank you.
Step 1: I noticed that when I stopped the run at some point for any reason, and then restarted it from the saved checkpoint after restoration, most graphs on TensorBoard (including rewards) charted out the line in EXACTLY the same fashion all over again, which made it look like the sequence was repeating.
Step 2: This led me to believe that there was something wrong with my checkpoints. I compared the weights in checkpoints using a loop and voila, they are all the same! Not a single change! So either there was something wrong with the saving/restoring of checkpoints which after a bit of playing around I found was not the case. So it just meant my weights were not being updated!
Step 3: I sifted through my training configuration to see if something there was preventing the network from learning, and I noticed I had set my "multiagent" configuration option "policies_to_train" to a policy that did not exist. This unfortunately, either did not throw a warning/error or it did and I completely missed it.
Solution step: By setting the multiagent "policies_to_train" configuration option correctly, it started to work!
Could it be that due to the multi-agent dynamics, your policy is chasing its tail? How many policies do you have? Are they competing/collaborating/neutral to each other?
Note that multi-agent training can be very unstable and seeing these fluctuations is quite normal as the different policies get updated and then have to face different "env"-dynamics b/c of that (env=env+all other policies, which appear as part of the env as well).
I am plowing TCL source code and get confused at macro NEXT_INST_F and NEXT_INST_V in tclExecute.c. Specifically the cleanup parameter of the macro.
Initially I thought cleanup means the net number of slots consumed/popped from the stack, e.g. when 3 objects are popped out and 1 object pushed in, cleanup is 2.
But I see INST_LOAD_STK has cleanup set to 1, shouldn't it be zero since one object is popped out and 1 object is pushed in?
I am lost reading the code of NEXT_INST_F and NEXT_INST_V, there are too many jumps.
Hope you can clarify the semantic of cleanup for me.
The NEXT_INST_F and NEXT_INST_V macros (in the implementation of Tcl's bytecode engine) clean up the state of the operand stack and push the result of the operation before going to the next instruction. The only practical difference between the two is that one is designed to be highly efficient when the number of stack locations to be cleaned up is a constant number (from a small range: 0, 1 and 2 — this is the overwhelming majority of cases), and the other is less efficient but can handle a variable number of locations to clean up or a number outside the small range. So NEXT_INST_F is basically an optimised version of NEXT_INST_V.
The place where macros are declared in tclExecute.c has this to say about them:
/*
* The new macro for ending an instruction; note that a reasonable C-optimiser
* will resolve all branches at compile time. (result) is always a constant;
* the macro NEXT_INST_F handles constant (nCleanup), NEXT_INST_V is resolved
* at runtime for variable (nCleanup).
*
* ARGUMENTS:
* pcAdjustment: how much to increment pc
* nCleanup: how many objects to remove from the stack
* resultHandling: 0 indicates no object should be pushed on the stack;
* otherwise, push objResultPtr. If (result < 0), objResultPtr already
* has the correct reference count.
*
* We use the new compile-time assertions to check that nCleanup is constant
* and within range.
*/
However, instructions can also directly manipulate the stack. This complicates things quite a lot. Most don't, but that's not the same as all. If you were to view this particular load of code as one enormous pile of special cases, you'd not be very wrong.
INST_LOAD_STK (a.k.a loadStk if you're reading disassembly of some Tcl code) is an operation that will pop an unparsed variable name from the stack and push the value read from the variable with that name. (Or an error will be thrown.) It is totally expected to pop one value and push another (from objResultPtr) since we are popping (and decrementing the reference count) of the variable name value, and pushing and incrementing the reference count of a different value that was read from the variable.
The code to read and write variables is among the most twisty in the bytecode engine. Far more goto than is good for your health.
I am working on expect scripting. I want to understand the difference between sleep and after. Any example will help me to understand.
There are three different entities:
The Tclx's sleep
The sleep command from the Tclx package. According to the documentation, it takes a decimal argument, taken to be the number of seconds to sleep. However, the fraction part is truncated. That means sleep 2.5 will sleep for two seconds.
The Expect's sleep
The sleep command from the Expect package. This is similar to its counterpart from the Tclx package. However, sleep 2.5 means sleeping for 2.5 seconds, there is no truncation.
After
Finally, the built-in after, which is a totally different beast. The after command takes its first input as the number of milliseconds to sleep off. This is the "synchronous" mode Jerry refers to. After also takes a second argument. In this case, after returns a token right away. After the specified time, the script will be executed. With the token, you can cancel the script.
I hope this helps.
sleep is similar to after "synchronous" mode, with the difference being that (emphasis mine):
Tcl's built-in after command uses delay units of milliseconds whereas the TclX/Expect command works with seconds (i.e., a factor of 1000 different). Be careful when converting.[1]
My try at a short explanation:
The Tcl sleep will like the TclX sleep just pause the script.
The after command can pause the script, but it is normally used for event based programming. It can execute a script after the elapsed time (if the event loop is running).
More on this see here at beedub.com.
Hi when i write this piece of code :
module memo(out1);
reg [3:0] mem [2:0] ;
output wire [3:0] out1;
initial
begin
mem[0][3:0]=4'b0000;
mem[1][3:0]=4'b1000;
mem[2][3:0]=4'b1010;
end
assign out1= mem[1];
endmodule
i get the following warnings which make the code unsynthesizable
WARNING:Xst:1780 - Signal mem<2> is never used or assigned. This unconnected signal will be trimmed during the optimization process.
WARNING:Xst:653 - Signal mem<1> is used but never assigned. This sourceless signal will be automatically connected to value 1000.
WARNING:Xst:1780 - Signal > is never used or assigned. This unconnected signal will be trimmed during the optimization process.
Why am i getting these warnings ?
Haven't i assigned the values of mem[0] ,mem[1] and mem[2]!?? Thanks for your help!
Your module has no inputs and a single output -- out1. I'm not totally sure what the point of the module is with respect to your larger system, but you're basically initializing mem, but then only using mem[1]. You could equivalently have a module which just assigns out1 to the value 4'b1000 (mem never changes). So yes -- you did initialize the array, but because you didn't use any of the other values the xilinx tools are optimizing your module during synthesis and "trimming the fat." If you were to simulate this module (say in modelsim) you'd see your initializations just fine. Based on your warnings though I'm not sure why you've come to the conclusion that your code is unsynthesizable. It appears to me that you could definitely synthesize it, but that it's just sort of a weird way to assign a single value to 4'b1000.
With regards to using initial begins to store values in block ram (e.g. to make a ROM) that's fine. I've done that several times without issue. A common use for this is to store coefficients in block ram, which are read out later. That stated the way this module is written there's no way to read anything out of mem anyway.
I am completely new to Perl, like absolute newbie. I am trying to develop a system which reads a database and, according to the results, generates a queue which launches another script.
HERE is the source code.
Now the script works as expected, except I have noticed that it doesn't really do the threads parallel. Whether I use 1 thread or 50 threads, the execution time is the same; 1 thread is even faster.
When I have the script display which thread did what, I see the threads don't run at the same time, because it will do thread 1, then 2, then 3 etc.
Does anyone know what I did wrong here? Again the script itself works, just not in parallel threads.
You need to learn what semaphores actually are before you start using them. You've explicitly told the threads not to run in parallel:
my $s = Thread::Semaphore->new;
#...
while ($queue_id_list->pending > 0) {
$s->down;
my $info = $queue_id_list->dequeue_nb;
if (defined($info)) {
my #details = split(/#/, $info);
#my $result = system("./match_name db=user_".$details[0]." id=".$details[1]);
# normally the script above would be launched which is a php script run in php-cli and does some database things
sleep(0.1);
#print "Thread: ". threads->self->tid. " - Done user: ".$details[0]. " and addressbook id: ". $details[1]."\r\n";
#print $queue_id_list->pending."\r\n";
}
$s->up;
}
You've created a semaphore $s, which by default has a count of 1. Then in the function you're trying to run, you call $s->down at the start -- which decreases the count by 1, or blocks if the count is already <1, and $s->up at the end, which increases the count by 1.
Once a thread calls down, no other threads will run until it calls up again.
You should carefully read the Thread::Semaphore docs, and probably this wikipedia article on semaphores, too.