Related
I'm a beginner in RL and want to know what is the advantage of having a state value function as well as an action-value function in RL algorithms, for example, Markov Design Process. What is the use of having both of them in prediction and control problems?
I think you mean state-value function and state-action-value function.
Quoting this answer by James MacGlashan:
To explain, lets first add a point of clarity. Value functions
(either V or Q) are always conditional on some policy π. To emphasize
this fact, we often write them as ππ(π ) and ππ(π ,π). In the
case when weβre talking about the value functions conditional on the
optimal policy πβ, we often use the shorthand πβ(π ) and πβ(π ,π).
Sometimes in literature we leave off the π or * and just refer to V
and Q, because itβs implicit in the context, but ultimately, every
value function is always with respect to some policy.
Bearing that in mind, the definition of these functions should clarify
the distinction for you.
ππ(π ) expresses the expected value of following policy π forever
when the agent starts following it from state π .
ππ(π ,π) expresses the expected value of first taking action π
from state π and then following policy π forever.
The main difference then, is the Q-value lets you play a hypothetical
of potentially taking a different action in the first time step than
what the policy might prescribe and then following the policy from the
state the agent winds up in.
For example, suppose in state π Iβm one step away from a terminating
goal state and I get -1 reward for every transition until I reach the
goal. Suppose my policy is the optimal policy so that it always tells
to me walk toward the goal. In this case, ππ(π )=β1 because Iβm just
one step away. However, if I consider the Q-value for an action π
that walks 1 step away from the goal, then ππ(π ,π)=β3 because
first I walk 1 step away (-1), and then I follow the policy which will
now take me two steps to get to the goal: one step to get back to
where I was (-1), and one step to get to the goal (-1), for a total of
-3 reward.
Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).
I've got this kind of problem with Proton CEP: i currently have a "Sequence" EPA; its input are 2 events. But these events have different granularity: let's say i have A and B events; i receive N "A" events, and M "B" events, where M << N.
So i'd like to have a rule like "if event of type A is not consumed within X seconds, remove it", otherwise i've got a long A events queue; i only need the rule to be evaluated for closest (temporally) events.
Practically, i've got a fake room temperature sensor that sends its temperature updates every 5seconds, and i've got another program that checks external weather and sends it every minute.
Any idea how to solve this situation?
Thank you very much!
I guess that in "consume" you mean arrival, so do you want to evaluate the time the A event took to get to the proton pcoressor? or the time between A events? Do you want to ensure that the A events are indeed continuous in a fix rate? "Removing" an event means to ignore it, since events are not kept anywhere, just processed. At the end, what is that you want to detect here? Like, what is the trend of room temperature compared to the outside temperature? then, emit output events accordingly?
Thanks.
all the relevant event instances are kept within the local state of a corresponding EPA.
For each EPA operand you have policies which dictates how the state is gathered and how the matching set for event derivation is built.
For example, instance selection policy which is defined per operand, and has the values of "Each", "First" and "Last" will tell you if all A instances are examined for match with B instance, or the first (in the order of arrival), or the last.
The consumption policy says what to do with the operand state once a seqence is detected - should the instances of say A which participated in sequence be removed from EPA's state ("consume" value of the policy) or should they remain.
Playing with combination of those policies should give you the behaviour you require
In developing for Fiware's Proton CEP, I came across an issue with Sequence event detection. I'll take advantage of DoSAttack example project, that comes with the software, to explain the issue.
I make two main changes to an original copy of DoSAttack:
-One is to make ExpectedCrash event have 3 more variables. This way I can log to DoSAttackTRConsumer file the 3 values that triggered it.
-Then I also change the Cardinality Policy of the Agent from Single to Unrestricted. This way the event can be triggered several times in a row, as TrafficReports come in (this may be a source to the issue).
I test this result and I find it works ok. I can see in the log that the values that trigger detection are the sequence of 3 values that arrived just before the event, after the first three events have arrived.
This, taking into account that the test beeing made on those 3 values still remains the original example test: (TR3.volume>1.50* TR2.volume AND TR2.volume>1.50 * TR1.volume).
The issue arrises if I make the test be just (TR3.volume>1.50* TR2.volume), for example, then CEP doesn't hold TR1 correctly. Now TR1 is the same as TR2, so cep loses "memory" of this value.
Going a step further, I make the test, just the condition (3>2) which is always true and should trigger a detection on any event that arrives. In this case, as events arrive, all TR1, TR2 and TR3 are the same and CEP has no memory of past values, even though the agent is of Type: Sequence.
The desired application would be for the CEP to recieve 22 readings as a sequence of input events and analyse only the 1st, 8th, 15th and 22nd values of this sequence, at each value that enters. But I find I can't make CEP remember the values correctly unless I'm testing all of them explicitly in the Condition view-box.
What would be the correct way to analyse the 1st, 8th, 15th and 22nd values that arrived, evaluating each time a new one arrives?
Here is the specificatin of DoSAttack, after altering it:
{"epn":{"events":[{"name":"TrafficReport","attributes":[{"name":"volume","type":"Integer","dimension":0}]},{"name":"ExpectedCrash","attributes":[{"name":"Cost","type":"Double","dimension":0},{"name":"TR1","type":"Integer","dimension":"0"},{"name":"TR2","type":"Integer","dimension":"0"},{"name":"TR3","type":"Integer","dimension":"0"}]}],"epas":[{"name":"IncreasingTraffic","epaType":"Sequence","context":"3MinAfterStartUp","inputEvents":[{"name":"TrafficReport","alias":"TR1","consumptionPolicy":"Consume","instanceSelectionPolicy":"First"},{"name":"TrafficReport","alias":"TR2","consumptionPolicy":"Consume","instanceSelectionPolicy":"First"},{"name":"TrafficReport","alias":"TR3","consumptionPolicy":"Consume","instanceSelectionPolicy":"First"}],"computedVariables":[],"assertion":"3>2","evaluationPolicy":"Immediate","cardinalityPolicy":"Unrestricted","internalSegmentation":[],"derivedEvents":[{"name":"ExpectedCrash","reportParticipants":false,"expressions":{"Cost":"10","TR1":"TR1.volume","TR2":"TR2.volume","TR3":"TR3.volume"}}],"derivedActions":[]}],"contexts":{"temporal":[{"name":"3MinAfterStartUp","type":"TemporalInterval","atStartup":true,"neverEnding":false,"initiators":[],"terminators":[{"terminatorType":"RelativeTime","terminationType":"Terminate","relativeTime":"180000"}]}],"segmentation":[],"composite":[]},"consumers":[{"name":"SysTemCrashConsumer","type":"File","properties":[{"name":"filename","value":"/opt/tomcat10/sample/DoSAttack_PredictedCrash.txt"},{"name":"formatter","value":"json"},{"name":"delimiter","value":";"},{"name":"tagDataSeparator","value":"="},{"name":"SendingDelay","value":"1000"}],"events":[{"name":"ExpectedCrash"}],"actions":[]},{"name":"DoSAttackTRConsumer","type":"File","properties":[{"name":"filename","value":"/opt/tomcat10/sample/DoSAttack_TrafficReport.txt"},{"name":"formatter","value":"json"},{"name":"delimiter","value":";"},{"name":"tagDataSeparator","value":"="},{"name":"SendingDelay","value":"1000"}],"events":[{"name":"TrafficReport"}],"actions":[]}],"producers":[{"name":"TrafficReportFileProducer","type":"File","properties":[{"name":"filename","value":"/opt/tomcat10/sample/DoSAttackScenarioJSON.txt"},{"name":"pollingInterval","value":"1000"},{"name":"sendingDelay","value":"1500"},{"name":"formatter","value":"json"},{"name":"delimiter","value":";"},{"name":"tagDataSeparator","value":"="}],"events":[]}],"actions":[],"name":"DoSAttack"}}
The producer file, DoSAttackScenarioJSON.txt, is still the original one, unaltered:
{"Name":"TrafficReport", "volume":"1000"}
{"Name":"TrafficReport", "volume":"1600"}
{"Name":"TrafficReport", "volume":"2500"}
If you do include more values than 3 you can see that the issue propagates.
If you need more information let me know.
Thank you
In the Sequence pattern, the engine looks for event instances that occurred in a particular order.
In Sequence (A, B, C), the engine looks for three event instances, the first one of type A, the second of type B and the third of type C, where:
(A's detection time) <= (B's detection time) AND (B's detection time) <= (C's detection time)
Usually in a Sequence pattern, either the event types are different, or there is other condition above the participants events (as in the DoSAttack example).
When you use the same event type in a sequence (e.g., Sequence(A, A, A)), then the same event instance can be used in all the three places, since it holds the detection order listed above.
In addition, if you use a "consumptionPolicy": "Consume" for a participant event, then after the event was used to detect the pattern, it will not be used for future detections of this pattern.
This is why when you have a Sequence(A, A, A) with no condition, and event instance A1 of type A arrives, it causes a pattern detection, and since it has Consume policy, it will not be kept for future detections. Later when event A2 of type A arrives, it causes another detection based on A2 alone.
Also, according to the Sequence built-in condition over the detection time, a sequence of events can be detected although other events arrived in between.
Please describe the pattern you would like to detect. Maybe you can use a Trend or Aggregate EPA instead.
I need to make an application for creating logic circuits and seeing the results. This is primarily for use in A-Level (UK, 16-18 year olds generally) computing courses.
Ive never made any applications like this, so am not sure on the best design for storing the circuit and evaluating the results (at a resomable speed, say 100Hz on a 1.6Ghz single core computer).
Rather than have the circuit built from the basic gates (and, or, nand, etc) I want to allow these gates to be used to make "chips" which can then be used within other circuits (eg you might want to make a 8bit register chip, or a 16bit adder).
The problem is that the number of gates increases massively with such circuits, such that if the simulation worked on each individual gate it would have 1000's of gates to simulate, so I need to simplify these components that can be placed in a circuit so they can be simulated quickly.
I thought about generating a truth table for each component, then simulation could use a lookup table to find the outputs for a given input. The problem occurred to me though that the size of such tables increase massively with inputs. If a chip had 32 inputs, then the truth table needs 2^32 rows. This uses a massive amount of memory in many cases more than there is to use so isn't practical for non-trivial components, it also wont work with chips that can store their state (eg registers) since they cant be represented as a simply table of inputs and outputs.
I know I could just hardcode things like register chips, however since this is for educational purposes I want it so that people can make their own components as well as view and edit the implementations for standard ones. I considered allowing such components to be created and edited using code (eg dlls or a scripting language), so that an adder for example could be represented as "output = inputA + inputB" however that assumes that the students have done enough programming in the given language to be able to understand and write such plugins to mimic the results of their circuit which is likly to not be the case...
Is there some other way to take a boolean logic circuit and simplify it automatically so that the simulation can determine the outputs of a component quickly?
As for storing the components I was thinking of storing some kind of tree structure, such that each component is evaluated once all components that link to its inputs are evaluated.
eg consider: A.B + C
The simulator would first evaluate the AND gate, and then evaluate the OR gate using the output of the AND gate and C.
However it just occurred to me that in cases where the outputs link back round to the inputs, will cause a deadlock because there inputs will never all be evaluated...How can I overcome this, since the program can only evaluate one gate at a time?
Have you looked at Richard Bowles's simulator?
You're not the first person to want to build their own circuit simulator ;-).
My suggestion is to settle on a minimal set of primitives. When I began mine (which I plan to resume one of these days...) I had two primitives:
Source: zero inputs, one output that's always 1.
Transistor: two inputs A and B, one output that's A and not B.
Obviously I'm misusing the terminology a bit, not to mention neglecting the niceties of electronics. On the second point I recommend abstracting to wires that carry 1s and 0s like I did. I had a lot of fun drawing diagrams of gates and adders from these. When you can assemble them into circuits and draw a box round the set (with inputs and outputs) you can start building bigger things like multipliers.
If you want anything with loops you need to incorporate some kind of delay -- so each component needs to store the state of its outputs. On every cycle you update all the new states from the current states of the upstream components.
Edit Regarding your concerns on scalability, how about defaulting to the first principles method of simulating each component in terms of its state and upstream neighbours, but provide ways of optimising subcircuits:
If you have a subcircuit S with inputs A[m] with m < 8 (say, giving a maximum of 256 rows) and outputs B[n] and no loops, generate the truth table for S and use that. This could be done automatically for identified subcircuits (and reused if the subcircuit appears more than once) or by choice.
If you have a subcircuit with loops, you may still be able to generate a truth table. There are fixed-point finding methods which can help here.
If your subcircuit has delays (and they are significant to the enclosing circuit) the truth table can incorporate state columns. E.g. if the subcircuit has input A, inner state B, and output C, where C <- A and B, B <- A, the truth table could be:
A B | B C
0 0 | 0 0
0 1 | 0 0
1 0 | 1 0
1 1 | 1 1
If you have a subcircuit that the user asserts implements a particular known pattern such as "adder", provide an option for using a hard-coded implementation for updating that subcircuit instead of by simulating its inner parts.
When I made a circuit emulator (sadly, also incomplete and also unreleased), here's how I handled loops:
Each circuit element stores its boolean value
When an element "E0" changes its value, it notifies (via the observer pattern) all who depend on it
Each observing element evaluates its new value and does likewise
When the E0 change occurs, a level-1 list is kept of all elements affected. If an element already appears on this list, it gets remembered in a new level-2 list but doesn't continue to notify its observers. When the sequence which E0 began has stopped notifying new elements, the next queue level is handled. Ie: the sequence is followed and completed for the first element added to level-2, then the next added to level-2, etc. until all of level-x is complete, then you move to level-(x+1)
This is in no way complete. If you ever have multiple oscillators doing infinite loops, then no matter what order you take them in, one could prevent the other from ever getting its turn. My next goal was to alleviate this by limiting steps with clock-based sync'ing instead of cascading combinatorials, but I never got this far in my project.
You might want to take a look at the From Nand To Tetris in 12 steps course software. There is a video talking about it on youtube.
The course page is at: http://www1.idc.ac.il/tecs/
If you can disallow loops (outputs linking back to inputs), then you can significantly simplify the problem. In that case, for every input there will be exactly one definite output. Cycles however can make the output undecideable (or rather, constantly changing).
Evaluating a circuit without loops should be easy - just use the BFS algorithm with "junctions" (connections between logic gates) as the items in the list. Start off with all the inputs to all the gates in an "undefined" state. As soon as a gate has all inputs "defined" (either 1 or 0), calculate its output and add its output junctions to the BFS list. This way you only have to evaluate each gate and each junction once.
If there are loops, the same algorithm can be used, but the circuit can be built in such a way that it never comes to a "rest" and some junctions are always changing between 1 and 0.
OOps, actually, this algorithm can't be used in this case because the looped gates (and gates depending on them) would forever stay as "undefined".
You could introduce them to the concept of Karnaugh maps, which would help them simplify truth values for themselves.
You could hard code all the common ones. Then allow them to build their own out of the hard coded ones (which would include low level gates), which would be evaluated by evaluating each sub-component. Finally, if one of their "chips" has less than X inputs/outputs, you could "optimize" it into a lookup table. Maybe detect how common it is and only do this for the most used Y chips? This way you have a good speed/space tradeoff.
You could always JIT compile the circuits...
As I haven't really thought about it, I'm not really sure what approach I'd take.. but it would possibly be a hybrid method and I'd definitely hard code popular "chips" in too.
When I was playing around making a "digital circuit" simulation environment, I had each defined circuit (a basic gate, a mux, a demux and a couple of other primitives) associated with a transfer function (that is, a function that computes all outputs, based on the present inputs), an "agenda" structure (basically a linked list of "when to activate a specific transfer function), virtual wires and a global clock.
I arbitrarily set the wires to hard-modify the inputs whenever the output changed and the act of changing an input on any circuit to schedule a transfer function to be called after the gate delay. With this at hand, I could accommodate both clocked and unclocked circuit elements (a clocked element is set to have its transfer function run at "next clock transition, plus gate delay", any unclocked element just depends on the gate delay).
Never really got around to build a GUI for it, so I've never released the code.