I am currently learning reinforcement learning and am have built a blackjack game.
There is an obvious reward at the end of the game (payout), however some actions do not directly lead to rewards (hitting on a count of 5), which should be encouraged, even if the end result is negative (loosing the hand).
My question is what should the reward be for those actions ?
I could hard code a positive reward (fraction of the reward for winning the hand) for hits which do not lead to busting, but it feels like I am not approaching the problem correctly.
Also, when I assign a reward for a win (after the hand is over), I update the q-value corresponding to the last action/state pair, which seems suboptimal, as this action may not have directly lead to the win.
Another option I thought is to assign the same end reward to all of the action/state pairs in the sequence, however, some actions (like hitting on count <10) should be encouraged even if it leads to a lost hand.
Note: My end goal is to use deep-RL with an LSTM, but I am starting with q-learning.
I would say to start simple and use the rewards the game dictates. If you win, you'll receive a reward +1, if you lose -1.
It seems you'd like to reward some actions based on human knowledge. Maybe start with using epsilon greedy and let the agent discover all actions. Play along with the discount hyperparameter which determines the importance of future rewards, and look if it comes with some interesting strategies.
This blog is about RL and Blackjack.
https://towardsdatascience.com/playing-blackjack-using-model-free-reinforcement-learning-in-google-colab-aa2041a2c13d
I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.
I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.
My only question is how the weights are updated. According to this site the weights are updated as follows:
I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?
In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:
Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.
In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this:
Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).
The equation you presented is the Q-learning update rule set in a gradient update rule.
alpha is the step-size
R is the reward
Gamma is the discounting factor
You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.
I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?
In short, is there a metric for evaluating the time that my tool has saved a human? Currently, I was thinking of using the (Time taken to read the original document/Time taken to read the summary) as a way of determining the time saved, but are there better metrics?
Currently, I am asking the user subjective questions about the accuracy of the summary.
In general:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words/ngrams from the system results appearing in the human references you will have high Bleu, and if you have many words/ngrams from the human references appearing in the system results you will have high Rouge.
There's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.
You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.
Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)
BLEU
Bleu measures precision
Bilingual Evaluation Understudy
Originally for machine translation(Bilingual)
W(machine generates summary) in (Human reference Summary)
That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
The closer a machine translation is to a professional human translation, the better it is
ROUGE
Rouge measures recall
Recall Oriented Understudy for Gisting Evaluation
-W(Human Reference Summary) In w(machine generates summary)
That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.
Overlap of N-grams between the system and references summaries.
-Rouge N, ehere N is n-gram
reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
Abstractive summarization
# Abstractive Summarize
len(reference_text.split())
from transformers import pipeline
summarization = pipeline("summarization")
abstractve_summarization = summarization(reference_text)[0]["summary_text"]
Abstractive Output
In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)
EXtractive summarization
# Extractive summarize
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
# parser.document.sentences
summarizer = LexRankSummarizer()
extractve_summarization = summarizer(parser.document,2)
extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])
Extractive Output
Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(abstractve_summarization, reference_text)
Using Rouge Abstractive summary output
[{'rouge-1': {'f': 0.22299651364421083,
'p': 0.9696969696969697,
'r': 0.12598425196850394},
'rouge-2': {'f': 0.21328671127225052,
'p': 0.9384615384615385,
'r': 0.1203155818540434},
'rouge-l': {'f': 0.29041095634452996,
'p': 0.9636363636363636,
'r': 0.17096774193548386}}]
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(extractve_summarization, reference_text)
Using Rouge Extractive summary output
[{'rouge-1': {'f': 0.27860696251962963,
'p': 0.8842105263157894,
'r': 0.16535433070866143},
'rouge-2': {'f': 0.22296172781038814,
'p': 0.7127659574468085,
'r': 0.13214990138067062},
'rouge-l': {'f': 0.354755780824869,
'p': 0.8734177215189873,
'r': 0.22258064516129034}}]
Interpreting rouge scores
ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:
I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.
The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.
Example:
S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police
S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4.
Historically, summarization systems have often been evaluated by comparing to human-generated reference summaries. In some cases, the human summarizer constructs a summary by selecting relevant sentences from the original document; in others, the summaries are hand-written from scratch.
Those two techniques are analogous to the two major categories of automatic summarization systems - extractive vs. abstractive (more details available on Wikipedia).
One standard tool is Rouge, a script (or a set of scripts; I can't remember offhand) that computes n-gram overlap between the automatic summary and a reference summary. Rough can optionally compute overlap allowing word insertions or deletions between the two summaries (e.g. if allowing a 2-word skip, 'installed pumps' would be credited as a match to 'installed defective flood-control pumps').
My understanding is that Rouge's n-gram overlap scores were fairly well correlated with human evaluation of summaries up to some level of accuracy, but that the relationship may break down as summarization quality improves. I.e., that beyond some quality threshold, summaries that are judged better by human evaluators may be scored similarly to - or outscored by - summaries judged inferior. Nevertheless, Rouge scores might be a helpful first cut at comparing 2 candidate summarization systems, or a way to automate regression testing and weed out serious regressions before passing a system on to human evaluators.
Your approach of collecting human judgements is probably the best evaluation, if you're able to afford the time / monetary cost. To add a little rigor to that process, you might look at the scoring criteria used in recent summarization tasks (see the various conferences mentioned by #John Lehmann). The scoresheets used by those evaluators might help guide your own evaluation.
I'm not sure about the time evaluation, but regarding accuracy you might consult literature under the topic Automatic Document Summarization. The primary evaluation was the Document Understanding Conference (DUC) until the Summarization task was moved into Text Analysis Conference (TAC) in 2008. Most of these focus on advanced summarization topics such as multi-document, multi-lingual, and update summaries.
You can find the evaluation guidelines for each of these events posted online. For single document summarization tasks look at DUC 2002-2004.
Or, you might consult the ADS evaluation section in Wikipedia.
There is also the very recent BERTScore metric (arXiv'19, ICLR'20, already almost 90 citations) that does not suffer from the well-known issues of ROUGE and BLEU.
Abstract from the paper:
We propose BERTScore, an automatic evaluation metric for text
generation. Analogously to common metrics, BERTScore computes a
similarity score for each token in the candidate sentence with each
token in the reference sentence. However, instead of exact matches, we
compute token similarity using contextual embeddings. We evaluate
using the outputs of 363 machine translation and image captioning
systems. BERTScore correlates better with human judgments and provides
stronger model selection performance than existing metrics. Finally,
we use an adversarial paraphrase detection task to show that BERTScore
is more robust to challenging examples when compared to existing
metrics.
Paper: https://arxiv.org/pdf/1904.09675.pdf
Code: https://github.com/Tiiiger/bert_score
Full reference:
Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019).
There are many parameters against which you can evaluate your summarization system.
like
Precision = Number of important sentences/Total number of sentences summarized.
Recall = Total number of important sentences Retrieved / Total number of important sentences present.
F Score = 2*(Precision*Recall/Precision+ Recall)
Compressed Rate = Total number of words in the summary / Total number of words in original document.
When you are evaluating an automatic summarisation system you would typically look at the content of the summary rather than time.
Your idea of:
(Time taken to read the original document/Time taken to read the summary)
Doesn't tell you much about your summarisation system, it really only gives you and idea of the compression rate of your system (i.e. the summary is 10% of the original document).
You may want to consider the time it takes your system to summarise a document vs. the time it would take a human (system: 2s, human: 10 mins).
I recommend BartScore. Check the Github page and the article. The authors issued also a meta-evaluation on the ExplainaBoard platform, "which allows to interactively understand the strengths, weaknesses, and complementarity of each metric". You can find the list of most of the state-of-the-art metrics there.
As a quick summary for collection of metrics, I wrote a post descrbing the evaluation metrics, what kind of metrics do we have ? what's the difference between human evaluation ? etc. You can read the blog post Evaluation Metrics: Assessing the quality of NLG outputs.
Also, along with the NLP projects we created and publicly released an evaluation package Jury which is still actively maintained and you can see the reasons why we created such a package in the repo. There are packages to carry out evaluation in NLP (some of them are specialized in a spesific NLP task):
jury
datasets
sacrebleu (Machine Translation)
SummEval (Summarization)