State definition in Reinforcement learning

State definition in Reinforcement learning - reinforcement-learning

When defining state for a specific problem in reinforcement learning, How to decide what to include and what to leave for the definition, and also how to set difference between an observation and a state.
For example assuming that the agent is in the context of human resource and planning where it needs to hire some workers based on the demand of jobs, considering the cost of hiring them (assuming the budget is limited) is a state in the format of (# workers, cost) a good definition of state?
In total I don't know what information is needed to be in state and what should be left as it's rather observation.
Thank you

I am assuming you are formulating this as an RL problem because the demand is an unknown quantity. And, maybe [this is optional criteria] the Cost of hiring them may take into account a worker's contribution towards the job which is unknown initially. If however, both these quantities are known or can be approximated beforehand then you can just run a Planning algorithm to solve the problem [or just some sort of Optimization].
Having said this, the state in this problem could be something as simple as (#workers). Note I'm not including the cost, because cost must be experienced by the agent, and therefore is unknown to the agent until it reaches a specific state. Depending on the problem, you might need to add another factor of "time", or the "job-remaining".
Most of the theoretical results on RL hinge on a key assumption in several setups that the environment is Markovian. There are several works where you can get by without this assumption, but if you can formulate your environment in a way that exhibits this property, then you would have much more tools to work with. The key idea being, the agent can decide which action to take (in your case, an action could be : Hire 1 more person. Other actions could be Fire a person) based on the current state, say (#workers = 5, time=6). Note that we are not distinguishing between workers yet, so firing "a" person, instead of firing "a specific" person x. If the workers have differing capabilities, you may need to add several other factors each representing which worker is currently hired, and which are currently in the pool, yet to be hired so like a boolean array of a fixed length. (I hope you get the idea of how to form a state representation, and this can vary based on the specifics of the problem, which are missing in your question).
Now, once we have the State definition S, the action definition A (hire / fire), we have the "known" quantities for an MDP-setup in an RL framework. We also need an environment that can supply us with the cost function when we query it (Reward Function / Cost Function), and tell us the outcome of taking a certain action on a certain state (Transition). Note that we don't necessarily need to know these Reward / Transition function beforehand, but we should have a means of getting these values when we query for a specific (state, action).
Coming to your final part, the difference between observation and state. There are much better resources to dig deep into it, but in a crude sense, observation is an agent's (any agent, AI, human etc) sensory data. For example, in your case the agent has the ability to count number of workers currently employed (but it does not have an ability to distinguish between workers).
A state, more formally, a true MDP state must be something that is Markovian and captures the environment at its fundamental level. So, maybe in order to determine the true cost to the company, the agent needs to be able to differentiate between workers, working hours of each worker, jobs they are working at, interactions between workers and so on. Note that, much of these factors may not be relevant to your task, for example a worker's gender. Typically one would like to form a good hypothesis on which factors are relevant beforehand.
Now, even though we can agree that a worker's assignment (to a specific job) maybe a relevant feature which making a decision to hire or fire them, your observation does not have this information. So you have two options, either you can ignore the fact that this information is important and work with what you have available, or you try to infer these features. If your observation is incomplete for the decision making in your formulation we typically classify them as Partially Observable Environments (and use POMDP frameworks for it).
I hope I clarified a few points, however, there is huge theory behind all of this and the question you asked about "coming up with a state definition" is a matter of research. (Much like feature engineering & feature selection in Machine Learning).

Related

Deep Reinforcement Learning, how to make an agent that control many machines

Good morning, Im facing a "RL" problem, which have many constraints, the main idea is that my agent will control many different machines with for example ordering them to go out for doing their missions (we don't give importance for the mission), or ordering them to enter to the depot and choosing for them the right place where they should sit (depending from constraints).
The problem is: the agent will take decision at periods of time that are defined, for each periode we know which of actions (go out, go in) are allowed. He will for example at 8oclock decide to order for 4 machines to go out, and at 14oclock decide to bring back 2 machines(with choosing for them the right place).
In literature i show many ideas which refers to BDQ, but is it recquired for my problem ? Im thinking about having actions like [chooseMachine1, chooseMachine2,chooseMachine3...chooseMachineN, goOut, goInPlace1, goInPlace2, goInPlace3, goInPlace4]. And in the code specifying the logic that depending of the period we are, i expose for the begening a number M<=N of the machines to choose (with giving 0 probability to those actions that aren't possible for the moment' if it is 14oclock you know that only the machines that are out are concerned with the agent decision'), if the agent choose Machine1, so he will access to only the possible actions from choosing it.
So, my question is, do you think that my ideas are right ? (am beginner), my idea is to make a DQN with giving the logic for the possible/impossible actions,
Do you think that a BDQ is more accurate with my problem ? like having N branchs for N machines which have the same possible actions (brach1(Machine1) : go out, goPlace1, goPlace2 ...)
If it is the case is there any implementation examples ?
If you have ressources to advise me, i will be glad of checking them.
Thank You

What would an agent navigating a maze do in case the chosen action would run it into a wall?
I think the usual approach in RL is to allow the move and than handle the result with the environment. In such a way the environment can simply make nothing happen or even give a negative reward when an action is "disallowed".
At training convergence the agent will hopefully learn to not chose ineffective actions.

How to account for rare events at different time intervals while using LSTM neural networks?

I'm working on an interesting sequence-to-sequence (regression) time series problem where some static features/rare events can change the behavior of future time series. The problem is a forecasting problem, where I use previous time step values to forecast the next time step values and I try to integrate static features + rare events into time step t=0.
In my problem, there is always a rare event at t=0 in addition to some static features that should affect the future behavior of time series.
For clarity, My definition of "rare events": an event that happens at a specific time step (for ex: t=0) and another separate event can happen at any time in the future as well (for ex: t=n) in addition to the event that happened at t=0 but, it happens only once at that time and both events can affect the future time series behavior starting from the time they occurred.
Even though most of the static features don't change over time, the rare events can be different from each other (has different characteristics/features). The time of each event is usually known because it will be applied due to outside human intervention to optimize the future behavior (increase profit) but, they do not necessarily happen at the same time step for every sample/example.
These events are so rare that it kind of makes sense to me to treat them as static features at time=0 but, I can't think of a way to include a rare event that happens n timesteps later in the future and has different characteristics than the event at t=0.
Below is an example schematic of the problem. There may be multiple samples with varying time steps affected by these unique rare events but, if I don't account for these events, I believe my predictions may suffer.
Can anyone suggest any sources to look at for these types of problems? I may also be missing key words that are usually used with these types of problems and that may be one of the reasons why I'm still having difficulties finding good sources. I call it "rare events" but, it may be called something else in the literature... At this point, I appreciate any type of source that addresses this issue such as scientific papers/articles, github code or a code example provided by you, correct keywords to search for, etc.
Thank you.
Example image to describe the problem

I have seen your rare event in the picture you have mentioned from the little information present in the picture it could be observed that there is some seasonality present in the rare event . So if you using rare to point out a random event i think that is not correct because it has seasonality (periodic).
In short words you are worried about the features that you using to train the model.
normal events
rare events
There may be multiple samples with varying time steps affected by
these unique rare events but, if I don't account for these events, I
believe my predictions may suffer.
If you are not certain whether your features rare ones you have mentioned are contributing/ or not etc.
You must switch to attention based mechanism because :
" Attention is all you need "
These models such as Bert are much better then LSTM because they add a attention
(Importance) feature to every feature so the model will learn automatically that how much weightage should be added to both rare and normal features.
I am explaining in very general terms as your question was not too much specific.
Have a nice day
stay blessed !

What are examples of real-world scenarios where a message queuing system can accept the loss of some messages?

I was reading this blog post, in which the author proposes the following question, in the context of message queues:
does it matter if a message is lost? If you application node, processing the request, dies, can you recover? You’ll be surprised how often it doesn’t actually matter, and you can function properly without guaranteeing all messages are processed
At first I thought that the main point of handling messages was to never loose a single message - after all, a message lost could mean a hotel reservation not booked, a checkout not completed, or any other functionality not carried through, which seems too similar to a bug for me. I suppose I am missing something, so, what are examples of scenarios where it is OK for a messaging system to loose a few messages?

Well, your initial expectation:
the main point of handling messageswas to never loose a single message
was just not a correct one.
Right, if one strives for a one certain type of robustness, where fail-safe measures have to take all due care and precautions, so as not a single message could get lost, yes, there your a priori expressed expectation fits.
This does not mean that all other system designs have to carry all the immense burdens and have to pay all that incurred costs ( resources-wise, latency-wise et al ), as the "100+% guaranteed delivery" systems do ( but, again, only if they can ).
Anti-pattern cases:
There are many use-cases, where an absolute certainty of delivery of each and every message originally sent is actually an anti-pattern.
Just imagine a weakly synchronised system ( including ones, that have nothing like backthrottling or even any simplest form of feedback propagation at all ), where the sensors read an actual temperature, a sound, a video-frame and send a message with that value(s).
Whenever a postprocessing system gets such information delivered, there may be a reason not to read any and all "old" values, but the most recent one(s).
If a delivery framework already got any newer set of values, all the "older" values, not processed yet, just hanging at some depth from the queue-head, yet in the queue, might create the anti-pattern, where one would not like to have to read and process any and all of those "older" values, but just the most recent one(s).
Like no one will make a trade with you based on yesterday prices, there is not positive value to make any new, current, decision based on reading any and all "old" temperature readings, that still wait in the queue.
Some smart-messaging frameworks provide explicit means for taking just the very "newest" message from a given source - thus enabling to imperatively discard any "older" messages, avoiding them from being read and processed right due to a known presence of a one "most" recent.
This answers the original question about the assumed main point of handling messages.
Efficiency first:
In any case, where a smart-delivery takes place ( either deliver an exact copy of the original message content or noting-at-all ), the resources are used at their best efforts, yet, without spending a single penny on anything but the "just-enough" smart-delivery.
Building robustness costs more than that.
Building an ultimate robustness, costs even way more than that.
Systems than do have such an extreme requirement can and may extend the resources-efficient smart-delivery so as to reach some requirements defined level of robustness, at some add-on costs.
The same but reversed is not possible -- if an "everything-proof" system is to get a slimmer form and fashion, so as to fit onto any restricted-resources hardware or to make it "forget" some "old" messages, that are of no positive value at this very moment ( but on the contrary, constitute a must for the processing element to read and process each and every "unwanted" message, just due to the fact it was delivered, while knowing a core-logic needs just the most recent one ).
Distributed systems accrue E2E-latency from many distributed sources, so any rigid-delivery system just block and penalise the only one element, who is ( latency-wise ) innocent -- the receiver.

I suppose it's OK to loose few messages from some measurement units that deliver the value once in.... Also for big data analytics solutions few lost messages won't make a big difference

It all depends on the application/larger system. The message queue is only one link in the chain, so to speak. If the application(s) at the ends are prepared to deal with loss, losing some messages is not a problem. If the application(s) rely on total messaging integrity then there will be problems.
An example of a system that will be ok with loss is weather updates for your phone. If a few temperature/wind updates don't make it to you there's no real harm in that.
Now, if you're running a nuclear reactor and you lose a few temperature updates on the core, well that is a problem.
I work a lot on safety critical, infrastructure-level systems, and am responsible for messaging much of the time. Many of those systems state clearly that messaging may reorder, duplicate, or lose messages; it's just a fact of life where distributed systems and networks are involved. The endpoint systems need to be designed to work correctly in that environment. So they track messages, ack end to end, deal with duplicates and retransmits, etc.

Modeling an RTS or how Blizzard was able to put together Starcraft 1 & 2?

Not sure if this question is related to software development, but I hope someone can at least point me in the right direction(no, not that direction..)
I am very curious as to how Blizzard achieve such a balance of strategic/tactic forces in their games? If you look at Starcraft 1 or now 2, each race have unique features that sort of counterpart other unique features of other races and all together create a pretty beautiful(to my mind at least) balance.
Is there some sort of area of mathematics that could help model these things? How do they do it basically?

I don't have full answer but here is what I know. Initially, when game is technically ready the balance is not ideal. When they started first public beta there were holes in balance that they patched very fast. They let players (testers) play as is and captured statistics of % of wins per race and tuned the parameters according to it. When the beta was at the end ratio was almost ideal: 33%/33%/33%.

I've no idea how Blizzard specifically did it - it might have just been through a lot of user testing.
However, the entire field of CS and statistics dedicated to these kinds of problems is simulation. Generally the idea is to construct a model of your system and then generate inputs according to some statistical distribution to try to understand the behaviour of that model. In the case of finding game balance, you would want to show that most sequences of game events led to some kind of equilibrium. This would probably involve some kind of stochastic modeling.

Linear algebra, maybe a little calculus, and a lot of testing.
There's no real mystery to this sort of thing, but people often just don't think about it in the rigorous terms you need to get a system that is fairly well-balanced.
Basically the designers know how quickly you can gather resources (both the best case and the average case), and they know how long it takes to build a unit, and they know roughly how powerful a unit is (eg. by reference to approximations such as damage per second). With this, you can ensure a unit's cost in resources makes sense. And from that, it's possible to compare resource gathering with unit cost to model the strength of a force growing over time. Similarly you can measure a force's capacity for damage over time, and compare the two. If there's a big disparity then they can tweak one or more of the variables to reduce it. Obviously there are many permutations of units but the interesting thing is that you only really need to understand the most powerful permutations - if a player picks a poor combination then it's ok if they lose. You don't want every approach to be equally good, as that would imply a boring game where your decisions are meaningless.
This is, of course, a simplification, but it helps get everything in roughly the right place and ensure that there are no units that are useless. Then testing can hammer down the rough edges and find most of the exploits that remain.

If you've tried SCII you've noticed that Blizzard records the relevant data for each game played on B.Net. They did the same in WC3, and presumably in SC1. With enough games stored, it is possible to get accurate results from statistical analysis. For example, the Protoss should win a third of all match-ups with similarly skilled opponents. If this is not the case, Blizzard can analyze the games where the Protoss won vs the games where they lost, and see what units made the difference. Then they can nerf those units (with a bit of in-house testing), and introduce the change at the next patch. Balancing is a difficult problem - a change that fixes problems in top-level games may break the balance in mid-level games.
Proper testing in a closed beta is impossible - you just can't accumulate enough data. This is why Blizzard do open betas. Even so the real test is the release of the game.

How to measure usability to get hard data?

There are a few posts on usability but none of them was useful to me.
I need a quantitative measure of usability of some part of an application.
I need to estimate it in hard numbers to be able to compare it with future versions (for e.g. reporting purposes). The simplest way is to count clicks and keystrokes, but this seems too simple (for example is the cost of filling a text field a simple sum of typing all the letters ? - I guess it is more complicated).
I need some mathematical model for that so I can estimate the numbers.
Does anyone know anything about this?
P.S. I don't need links to resources about designing user interfaces. I already have them. What I need is a mathematical apparatus to measure existing applications interface usability in hard numbers.
Thanks in advance.

http://www.techsmith.com/morae.asp
This is what Microsoft used in part when they spent millions redesigning Office 2007 with the ribbon toolbar.
Here is how Office 2007 was analyzed:
http://cs.winona.edu/CSConference/2007proceedings/caty.pdf
Be sure to check out the references at the end of the PDF too, there's a ton of good stuff there. Look up how Microsoft did Office 2007 (regardless of how you feel about it), they spent a ton of money on this stuff.

Your main ideas to approach in this are Effectiveness and Efficiency (and, in some cases, Efficacy). The basic points to remember are outlined on this webpage.
What you really want to look at doing is 'inspection' methods of measuring usability. These are typically more expensive to set up (both in terms of time, and finance), but can yield significant results if done properly. These methods include things like heuristic evaluation, which is simply comparing the system interface, and the usage of the system interface, with your usability heuristics (though, from what you've said above, this probably isn't what you're after).
More suited to your use, however, will be 'testing' methods, whereby you observe users performing tasks on your system. This is partially related to the point of effectiveness and efficiency, but can include various things, such as the "Think Aloud" concept (which works really well in certain circumstances, depending on the software being tested).
Jakob Nielsen has a decent (short) article on his website. There's another one, but it's more related to how to test in order to be representative, rather than how to perform the testing itself.

Consider measuring the time to perform critical tasks (using a new user and an experienced user) and the number of data entry errors for performing those tasks.

First you want to define goals: for example increasing the percentage of users who can complete a certain set of tasks, and reducing the time they need for it.
Then, get two cameras, a few users (5-10) give them a list of tasks to complete and ask them to think out loud. Half of the users should use the "old" system, the rest should use the new one.
Review the tapes, measure the time it took, measure success rates, discuss endlessly about interpretations.
Alternatively, you can develop a system for bucket-testing -- it works the same way, though it makes it far more difficult to find out something new. On the other hand, it's much cheaper, so you can do many more iterations. Of course that's limited to sites you can open to public testing.
That obviously implies you're trying to get comparative data between two designs. I can't think of a way of expressing usability as a value.

You might want to look into the GOMS model (Goals, Operators, Methods, and Selection rules). It is a very difficult research tool to use in my opinion, but it does provide a "mathematical" basis to measure performance in a strictly controlled environment. It is best used with "expert" users. See this very interesting case study of Project Ernestine for New England Telephone operators.

Measuring usability quantitatively is an extremely hard problem. I tackled this as a part of my doctoral work. The short answer is, yes, you can measure it; no, you can't use the results in a vacuum. You have to understand why something took longer or shorter; simply comparing numbers is worse than useless, because it's misleading.
For comparing alternate interfaces it works okay. In a longitudinal study, where users are bringing their past expertise with version 1 into their use of version 2, it's not going to be as useful. You will also need to take into account time to learn the interface, including time to re-understand the interface if the user's been away from it. Finally, if the task is of variable difficulty (and this is the usual case in the real world) then your numbers will be all over the map unless you have some way to factor out this difficulty.
GOMS (mentioned above) is a good method to use during the design phase to get an intuition about whether interface A is better than B at doing a specific task. However, it only addresses error-free performance by expert users, and only measures low-level task execution time. If the user figures out a more efficient way to do their work that you haven't thought of, you won't have a GOMS estimate for it and will have to draft one up.
Some specific measures that you could look into:
Measuring clock time for a standard task is good if you want to know what takes a long time. However, lab tests generally involve test subjects working much harder and concentrating much more than they do in everyday work, so comparing results from the lab to real users is going to be misleading.
Error rate: how often the user makes mistakes or backtracks. Especially if you notice the same sort of error occurring over and over again.
Appearance of workarounds; if your users are working around a feature, or taking a bunch of steps that you think are dumb, it may be a sign that your interface doesn't give the tools to figure out how to solve their problems.
Don't underestimate simply asking users how well they thought things went. Subjective usability is finicky but can be revealing.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008