Variable actions in Deep Reinforcement Learning - reinforcement-learning

I'm trying to teach an AI the combat mechanics of a system similar to Darkest Dungeon.
The goal is for the AI to be able to act well controlling NPCs with random stats and random skills. This means that on each session the AI's character will different values for health, stress, accuracy, dodge, etc... The stats of each skill the character has also has random values - the damage, accuracy, effects.
In my current system-
The inputs for the model are:
All the stats that the AI's character has.
All the stats of it's
allies.
A subset of stats of each enemy (in Darkest Dungeon you
cannot see all the enemies's stats).
All the stats of each skill.
The outputs for the model are:
Which skill to use (out of 4 options)
Which target (total of 8 options, being any ally (including self) and
any enemy)
I'm using an action mask to disable invalid actions (such as using an offensive skill on an ally or targeting a position that has no character on).
The main problem I'm having is that what each action does changes heavily depending on the stats of the skill in that index.
Does anyone have an insight on which kind of learning I'm looking for? So far I have tried using MA-POCA provided by the Unity Ml-Agents package with no success, the model didn't seem to understand that what each action does relies extremely on the associated skill stats.
Searching for papers on the subject only resulted in articles about action-spaces with variable size, which I already solve with by masking invalid actions.
Obs: I'm not limited to training in the Unity environment. The only limitation I have is that the model must be convertible/exported to ONNX format.

Related

Is the reward related to previous state or next state?

In the reinforcement learning framework, I am a little bit confused about the reward and how it is related to states. For example, in Q-learning, we have the following formula for updating the Q table:
that means that the reward is obtained from the environment at the time t+1. I mean that after applying the action at, the environment gives st+1 and rt+1.
It is often true that the reward is associated with the previous time step, that is using rt in the above formula. See, for example the Wikipedia page for Q-learning (https://en.wikipedia.org/wiki/Q-learning). Why is this?
Accidentally, some Wikipedia pages about the same topic but in different languages, use rt+1 (or unexpectedly Rt+1). See, for example, the Italian and Japanese pages:
https://it.wikipedia.org/wiki/Q-learning
https://ja.wikipedia.org/wiki/Q%E5%AD%A6%E7%BF%92

Which reinforcement learning algorithm is applicable to a problem with a continuously variable reward and no intermediate rewards?

I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.

Building an autonomic drugs widget for medical education

I've made my way over to this community because I'm planning on building a widget to help medical students with understanding the effects of various autonomic medications on cardiovascular metrics like heart rate (HR), BP (systolic, diastolic, and mean) and peripheral resistance (SVR). Some background - I'm a 3rd year med student in the US without fluency in any programming languages (which makes this particularly difficult), but am willing to spend the time to pick up what I need to know to make this happen.
Regarding the project:
The effects of autonomic medications like epinephrine, norepinephrine, beta-blockers, and alpha-blockers on the cardiovascular system is of great interest to physicians because these drugs can be used to resuscitate, to prep for surgery, to slow the progression of cardiovascular and respiratory disease, and even as antidotes for certain toxicities. There are four receptor types we are primarily concerned with - alpha1, alpha2, beta1, beta2. The receptor selectivity profile of any given drug is what governs its effects on the CV system. The way these effects are taught and tested in med school classrooms and by the United States board exams is in the form of graphs.
The impetus for this project is that me and many of my classmates struggled with this concept when we were initially learning it, and I believe a large part of that arises from the lack of a resource which shows the changes in the graphs from baseline, in real time.
When being taught this info, we are required to consider: a) the downstream effects when the receptor types listed above are stimulated (by an agonist drug) or inhibited (by an antagonist); b) the receptor specificities of each of the autonomic drugs of interest (there are about 8 that are very important); c) how to interpret the graphs shown above and how those graphs would change if multiple autonomics were administered in succession. (Exams and the boards love to show graphs with various points marked along it, then ask which drugs are responsible for the changes seen, just like the example above.)
The current methods of learning these three points is a mess, and having gone through it, I'd like to do what I can to contribute to building a more effective resource.
My goal is to create a widget that allows a user to visualize these changes with up to 3 drugs in succession. Here is a rough sketch of the goal.
In this example, norepinephrine has strong alpha1 agonist effects which causes an increase in systolic (blue line), diastolic (red line), and mean BP, as well as peripheral resistance. Due to the increased BP, there is a reflexive decrease in HR.
Upon the administration of phentolamine, a strong alpha1 antagonist, the BP and SVR decline while HR increases reflexively.
Regarding the widget, I would like the user to be able to choose up to 3 drugs from a drop down menu (eg. Drug 1, Drug 2, Drug 3), and the graphs to reflect the effects of those drugs on the CV metrics while ALSO taking into account the interactions of the drugs with themselves.
This is an IMPORTANT point: the order in which drugs are added is important because certain receptors become blocked, preventing other drugs from having their primary effect so they revert to their secondary effect.
If you're still following me on this, what I'm looking for is some help in figuring out how best to approach all the possibilities that can happen. Should I try to understand if-then statements and write a script to produce graphs based off those? (eg. if epi, then Psys = x, Pdia = y, MAP = z). Should I create a contingency table in excel in which I list the 8 drugs I'm focusing on and make values for the metrics and then plot those, essentially taking into account all the permutations? Any thoughts and direction would be greatly appreciated.
Thank you for your time.

How do I evaluate a text summarization tool?

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?
In short, is there a metric for evaluating the time that my tool has saved a human? Currently, I was thinking of using the (Time taken to read the original document/Time taken to read the summary) as a way of determining the time saved, but are there better metrics?
Currently, I am asking the user subjective questions about the accuracy of the summary.
In general:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words/ngrams from the system results appearing in the human references you will have high Bleu, and if you have many words/ngrams from the human references appearing in the system results you will have high Rouge.
There's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.
You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.
Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)
BLEU
Bleu measures precision
Bilingual Evaluation Understudy
Originally for machine translation(Bilingual)
W(machine generates summary) in (Human reference Summary)
That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
The closer a machine translation is to a professional human translation, the better it is
ROUGE
Rouge measures recall
Recall Oriented Understudy for Gisting Evaluation
-W(Human Reference Summary) In w(machine generates summary)
That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.
Overlap of N-grams between the system and references summaries.
-Rouge N, ehere N is n-gram
reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
Abstractive summarization
# Abstractive Summarize
len(reference_text.split())
from transformers import pipeline
summarization = pipeline("summarization")
abstractve_summarization = summarization(reference_text)[0]["summary_text"]
Abstractive Output
In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)
EXtractive summarization
# Extractive summarize
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
# parser.document.sentences
summarizer = LexRankSummarizer()
extractve_summarization = summarizer(parser.document,2)
extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])
Extractive Output
Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(abstractve_summarization, reference_text)
Using Rouge Abstractive summary output
[{'rouge-1': {'f': 0.22299651364421083,
'p': 0.9696969696969697,
'r': 0.12598425196850394},
'rouge-2': {'f': 0.21328671127225052,
'p': 0.9384615384615385,
'r': 0.1203155818540434},
'rouge-l': {'f': 0.29041095634452996,
'p': 0.9636363636363636,
'r': 0.17096774193548386}}]
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(extractve_summarization, reference_text)
Using Rouge Extractive summary output
[{'rouge-1': {'f': 0.27860696251962963,
'p': 0.8842105263157894,
'r': 0.16535433070866143},
'rouge-2': {'f': 0.22296172781038814,
'p': 0.7127659574468085,
'r': 0.13214990138067062},
'rouge-l': {'f': 0.354755780824869,
'p': 0.8734177215189873,
'r': 0.22258064516129034}}]
Interpreting rouge scores
ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:
I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.
The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.
Example:
S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police
S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4.
Historically, summarization systems have often been evaluated by comparing to human-generated reference summaries. In some cases, the human summarizer constructs a summary by selecting relevant sentences from the original document; in others, the summaries are hand-written from scratch.
Those two techniques are analogous to the two major categories of automatic summarization systems - extractive vs. abstractive (more details available on Wikipedia).
One standard tool is Rouge, a script (or a set of scripts; I can't remember offhand) that computes n-gram overlap between the automatic summary and a reference summary. Rough can optionally compute overlap allowing word insertions or deletions between the two summaries (e.g. if allowing a 2-word skip, 'installed pumps' would be credited as a match to 'installed defective flood-control pumps').
My understanding is that Rouge's n-gram overlap scores were fairly well correlated with human evaluation of summaries up to some level of accuracy, but that the relationship may break down as summarization quality improves. I.e., that beyond some quality threshold, summaries that are judged better by human evaluators may be scored similarly to - or outscored by - summaries judged inferior. Nevertheless, Rouge scores might be a helpful first cut at comparing 2 candidate summarization systems, or a way to automate regression testing and weed out serious regressions before passing a system on to human evaluators.
Your approach of collecting human judgements is probably the best evaluation, if you're able to afford the time / monetary cost. To add a little rigor to that process, you might look at the scoring criteria used in recent summarization tasks (see the various conferences mentioned by #John Lehmann). The scoresheets used by those evaluators might help guide your own evaluation.
I'm not sure about the time evaluation, but regarding accuracy you might consult literature under the topic Automatic Document Summarization. The primary evaluation was the Document Understanding Conference (DUC) until the Summarization task was moved into Text Analysis Conference (TAC) in 2008. Most of these focus on advanced summarization topics such as multi-document, multi-lingual, and update summaries.
You can find the evaluation guidelines for each of these events posted online. For single document summarization tasks look at DUC 2002-2004.
Or, you might consult the ADS evaluation section in Wikipedia.
There is also the very recent BERTScore metric (arXiv'19, ICLR'20, already almost 90 citations) that does not suffer from the well-known issues of ROUGE and BLEU.
Abstract from the paper:
We propose BERTScore, an automatic evaluation metric for text
generation. Analogously to common metrics, BERTScore computes a
similarity score for each token in the candidate sentence with each
token in the reference sentence. However, instead of exact matches, we
compute token similarity using contextual embeddings. We evaluate
using the outputs of 363 machine translation and image captioning
systems. BERTScore correlates better with human judgments and provides
stronger model selection performance than existing metrics. Finally,
we use an adversarial paraphrase detection task to show that BERTScore
is more robust to challenging examples when compared to existing
metrics.
Paper: https://arxiv.org/pdf/1904.09675.pdf
Code: https://github.com/Tiiiger/bert_score
Full reference:
Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019).
There are many parameters against which you can evaluate your summarization system.
like
Precision = Number of important sentences/Total number of sentences summarized.
Recall = Total number of important sentences Retrieved / Total number of important sentences present.
F Score = 2*(Precision*Recall/Precision+ Recall)
Compressed Rate = Total number of words in the summary / Total number of words in original document.
When you are evaluating an automatic summarisation system you would typically look at the content of the summary rather than time.
Your idea of:
(Time taken to read the original document/Time taken to read the summary)
Doesn't tell you much about your summarisation system, it really only gives you and idea of the compression rate of your system (i.e. the summary is 10% of the original document).
You may want to consider the time it takes your system to summarise a document vs. the time it would take a human (system: 2s, human: 10 mins).
I recommend BartScore. Check the Github page and the article. The authors issued also a meta-evaluation on the ExplainaBoard platform, "which allows to interactively understand the strengths, weaknesses, and complementarity of each metric". You can find the list of most of the state-of-the-art metrics there.
As a quick summary for collection of metrics, I wrote a post descrbing the evaluation metrics, what kind of metrics do we have ? what's the difference between human evaluation ? etc. You can read the blog post Evaluation Metrics: Assessing the quality of NLG outputs.
Also, along with the NLP projects we created and publicly released an evaluation package Jury which is still actively maintained and you can see the reasons why we created such a package in the repo. There are packages to carry out evaluation in NLP (some of them are specialized in a spesific NLP task):
jury
datasets
sacrebleu (Machine Translation)
SummEval (Summarization)

AI for a Final fantasy tactics-like game

I am implementing a small grid based, turn based strategy in the lines of Final Fantasy tactics.
Do you have any ideas on how i can approach the target selection, movement and skill selection process?
I am considering having the decisions disconnected, but all these 3 decisions are largely coupled.
(eg. i can't decide where to move unless i know who i am going to attack, and what range the skill i will use has, and vice versa, i can't decide who to attack unless i know how many turns it will take me to reach each target)
I want to move towards a unified system, but trying out things from Potential field research used in a manner like in the Killzone 1 AI has me getting stuck on local maximums.
=== Update 1
I am currently trying to use potential fields / influence maps to generate the data i take decisions upon.
I have no idea how to handle having many skills, and skills that don't do damage but rather buff/debuff or alter the world.
Someone elsewhere suggested using Monte Carlo Tree Search, used currently in Go games.
I believe the space my actors will be using is not good for it, as many many moves in the game don't result in a position from which you can attack and affect the world (i am in a world bigger than final fantasy tactics)
In final fantasy tactics it might be applied successfully, although the branching factor is much bigger than that of 9x9 Go (from what i understand)
===
Thanks in advance, Xtapodi.
ps.1 - A problem is that to know accurately how far an enemy is i would need to pathfind to him, because although the enemy is near, an impassable cliff might be separating us which takes 4 turns to go around. Or worse, a unit is blocking the way on lets say a bridge so there is actually no way to reach him.
One approach I've used is to do a two-pass system.
First, find out where your unit can go. Use A* or whatever to flag out the terrain to see how far the unit can move this turn.
Once you know that, step through your available tactics (melee attack, heal friendly unit, whatever), and assign a fitness function for all available uses of the tactic. If you pass in the flagged terrain, you can very quickly determine what your space of possible tactics are.
This gives you a list of available tactics and their fitness functions for each move. Select the best one or randomize from the top. If there aren't any tactics available, repeat the process with flagging the terrain for two moves, and so on.
What I mean by fitness function is to decide on the "value" of performing the tactic on a certain unit or location. For instance, your "heal a friendly unit" tactical decision phase might step through all friendly units. If a friendly unit is within range (i.e., is reachable from a location your unit can reach), add it to the list of possible tactics and give it a fitness rating equal to, say, 100 * (1.0 - unit health), where unit health ranges from 0 to 1. Thus, healing a character down to only 10% health remaining would be worth 90 points, while a unit only down 5% would only be worth 5, and the unit wouldn't even consider healing an undamaged unit. Special units (i.e., "protect the boss" scenario units required to retain victory conditions) could be given a higher base number, so that they are given more attention by friendly units.
Similarly, your "melee attack" decision phase would step through all reachable enemy units, compute the likely damage, and compare that to the unit's health. Give each unit a "desirability" to attack, and multiply it by the percentage of remaining health you'd likely do, and you've got a pretty detailed fitness function that favors eliminating units when you can, but still goes after high-value targets.
Using a process like this, you'll get a list of options like "Move to location A and heal friendly unit B : 50 points", "Move to location C and attack hostile unit D : 15 points", etc. Suddenly, it's really easy to choose a tactic.
Further detail may be added by multiplying the fitness of the tactic by a fitness for the path you'd have to take to implement it. For instance, if the place you'd have to move to in order to heal a friendly unit puts you in severe danger (i.e., standing on a lava space or something), you might factor that in by multiplying the fitness of that tactic by .2 or so, so that the unit may still consider it, but only if it's really important. All this takes is writing an algorithm to assess the fitness of a given location, and could be as simple as a pre-computed "terrain desirability" number or as complex as maintaining "threat maps" of enemy units.
The hard part, of course, is finding the right measures to make the engine smart. But that's the fun part of your system to tweak.
If the terrain where the battle occurs are pre-determined, or not too wide, there is an article on terrain reasonning in FPS that can be used as a basis for a turn-based game.
In short, you pre-calculate for each cell of the map a set of values, such as suitability for shooting in a given direction, protection, visibility... and so on. the AI can then use these values to choose a correct action. For exemple, fighter will walk as quickly as possible toward ennemy, using protection if available, while thief will take a path where visibility from ennemy direction as low as possible, with the goal of attacking from flank or rear.
if the terrain is randomized and/or too wide, the pre-calcul can be to long to be useful, however.
regards
Guillaume
A good question the answers can be all over the place. Personally, I don't have a lot of experience with this but I would set a strategy around concept not distance.
You are going to create a state machine for each NPC. It will be predicting a character to attack via some settings.
For example a NPC would be flagged as Attack weakest or Attack Strongest or Attack Most Injured. Then I would attempt to position them such that they can damage there desired target.
If you also have healers you can do the same thing in reverse for the healer target.
Target changing will be an important part of this system too. So you will want to think about that. A simple version is to reevaluate changing target a given percentage of the turns.
And finally, I would add random chance into the system. For example a character could be set as follows
Attack Weakest .25
Attack Strongest .50
Attack Most Injured .25
Change target .1
When it's time to attack. You generate a random number from 0-1. If it's under you Change targets you change target by generating another random number of what target to attack.
You can begin to factor distance into your system by augmenting the attack mode percentages.
For example if it would take 3 turns to attack the most injured. Decrease it's percentage of being targeted by dividing that value by 3 and distributing the difference to the other two possibilities.