How do I evaluate a text summarization tool? - language-agnostic

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?
In short, is there a metric for evaluating the time that my tool has saved a human? Currently, I was thinking of using the (Time taken to read the original document/Time taken to read the summary) as a way of determining the time saved, but are there better metrics?
Currently, I am asking the user subjective questions about the accuracy of the summary.

In general:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words/ngrams from the system results appearing in the human references you will have high Bleu, and if you have many words/ngrams from the human references appearing in the system results you will have high Rouge.
There's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.
You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.
Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)

BLEU
Bleu measures precision
Bilingual Evaluation Understudy
Originally for machine translation(Bilingual)
W(machine generates summary) in (Human reference Summary)
That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
The closer a machine translation is to a professional human translation, the better it is
ROUGE
Rouge measures recall
Recall Oriented Understudy for Gisting Evaluation
-W(Human Reference Summary) In w(machine generates summary)
That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.
Overlap of N-grams between the system and references summaries.
-Rouge N, ehere N is n-gram
reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
Abstractive summarization
# Abstractive Summarize
len(reference_text.split())
from transformers import pipeline
summarization = pipeline("summarization")
abstractve_summarization = summarization(reference_text)[0]["summary_text"]
Abstractive Output
In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)
EXtractive summarization
# Extractive summarize
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
# parser.document.sentences
summarizer = LexRankSummarizer()
extractve_summarization = summarizer(parser.document,2)
extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])
Extractive Output
Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(abstractve_summarization, reference_text)
Using Rouge Abstractive summary output
[{'rouge-1': {'f': 0.22299651364421083,
'p': 0.9696969696969697,
'r': 0.12598425196850394},
'rouge-2': {'f': 0.21328671127225052,
'p': 0.9384615384615385,
'r': 0.1203155818540434},
'rouge-l': {'f': 0.29041095634452996,
'p': 0.9636363636363636,
'r': 0.17096774193548386}}]
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(extractve_summarization, reference_text)
Using Rouge Extractive summary output
[{'rouge-1': {'f': 0.27860696251962963,
'p': 0.8842105263157894,
'r': 0.16535433070866143},
'rouge-2': {'f': 0.22296172781038814,
'p': 0.7127659574468085,
'r': 0.13214990138067062},
'rouge-l': {'f': 0.354755780824869,
'p': 0.8734177215189873,
'r': 0.22258064516129034}}]
Interpreting rouge scores
ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:
I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.
The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.
Example:
S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police
S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4.

Historically, summarization systems have often been evaluated by comparing to human-generated reference summaries. In some cases, the human summarizer constructs a summary by selecting relevant sentences from the original document; in others, the summaries are hand-written from scratch.
Those two techniques are analogous to the two major categories of automatic summarization systems - extractive vs. abstractive (more details available on Wikipedia).
One standard tool is Rouge, a script (or a set of scripts; I can't remember offhand) that computes n-gram overlap between the automatic summary and a reference summary. Rough can optionally compute overlap allowing word insertions or deletions between the two summaries (e.g. if allowing a 2-word skip, 'installed pumps' would be credited as a match to 'installed defective flood-control pumps').
My understanding is that Rouge's n-gram overlap scores were fairly well correlated with human evaluation of summaries up to some level of accuracy, but that the relationship may break down as summarization quality improves. I.e., that beyond some quality threshold, summaries that are judged better by human evaluators may be scored similarly to - or outscored by - summaries judged inferior. Nevertheless, Rouge scores might be a helpful first cut at comparing 2 candidate summarization systems, or a way to automate regression testing and weed out serious regressions before passing a system on to human evaluators.
Your approach of collecting human judgements is probably the best evaluation, if you're able to afford the time / monetary cost. To add a little rigor to that process, you might look at the scoring criteria used in recent summarization tasks (see the various conferences mentioned by #John Lehmann). The scoresheets used by those evaluators might help guide your own evaluation.

I'm not sure about the time evaluation, but regarding accuracy you might consult literature under the topic Automatic Document Summarization. The primary evaluation was the Document Understanding Conference (DUC) until the Summarization task was moved into Text Analysis Conference (TAC) in 2008. Most of these focus on advanced summarization topics such as multi-document, multi-lingual, and update summaries.
You can find the evaluation guidelines for each of these events posted online. For single document summarization tasks look at DUC 2002-2004.
Or, you might consult the ADS evaluation section in Wikipedia.

There is also the very recent BERTScore metric (arXiv'19, ICLR'20, already almost 90 citations) that does not suffer from the well-known issues of ROUGE and BLEU.
Abstract from the paper:
We propose BERTScore, an automatic evaluation metric for text
generation. Analogously to common metrics, BERTScore computes a
similarity score for each token in the candidate sentence with each
token in the reference sentence. However, instead of exact matches, we
compute token similarity using contextual embeddings. We evaluate
using the outputs of 363 machine translation and image captioning
systems. BERTScore correlates better with human judgments and provides
stronger model selection performance than existing metrics. Finally,
we use an adversarial paraphrase detection task to show that BERTScore
is more robust to challenging examples when compared to existing
metrics.
Paper: https://arxiv.org/pdf/1904.09675.pdf
Code: https://github.com/Tiiiger/bert_score
Full reference:
Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019).

There are many parameters against which you can evaluate your summarization system.
like
Precision = Number of important sentences/Total number of sentences summarized.
Recall = Total number of important sentences Retrieved / Total number of important sentences present.
F Score = 2*(Precision*Recall/Precision+ Recall)
Compressed Rate = Total number of words in the summary / Total number of words in original document.

When you are evaluating an automatic summarisation system you would typically look at the content of the summary rather than time.
Your idea of:
(Time taken to read the original document/Time taken to read the summary)
Doesn't tell you much about your summarisation system, it really only gives you and idea of the compression rate of your system (i.e. the summary is 10% of the original document).
You may want to consider the time it takes your system to summarise a document vs. the time it would take a human (system: 2s, human: 10 mins).

I recommend BartScore. Check the Github page and the article. The authors issued also a meta-evaluation on the ExplainaBoard platform, "which allows to interactively understand the strengths, weaknesses, and complementarity of each metric". You can find the list of most of the state-of-the-art metrics there.

As a quick summary for collection of metrics, I wrote a post descrbing the evaluation metrics, what kind of metrics do we have ? what's the difference between human evaluation ? etc. You can read the blog post Evaluation Metrics: Assessing the quality of NLG outputs.
Also, along with the NLP projects we created and publicly released an evaluation package Jury which is still actively maintained and you can see the reasons why we created such a package in the repo. There are packages to carry out evaluation in NLP (some of them are specialized in a spesific NLP task):
jury
datasets
sacrebleu (Machine Translation)
SummEval (Summarization)

Related

How to choose the best LDA model when coherence and perplexy show opposed trends?

I have a corpus with around 1,500,000 documents of titles and abstracts from scientific research projects within STEM. I used Mallet https://mimno.github.io/Mallet/transforms to fit models from 10 to 790 topics in 10 topics increments (I allow for hyperparameter optimization). I ran three replicates with differing seeds using 70% of the corpus for training and 30% for validation. The corpus was pre-processed to remove stop-words, lemmatise, etc.
I want to choose the model with best number of topics. I understand that coherence and perplexity are two quantitative measures of model performance that should allow me to do this. They measure two 'separate' types of performance. Coherence measures the ability of the model finding topics that are the closest to human interpretability. While perplexity measures the ability of the model to classify unseen data with a trained model.
However, when I calculate these two methods they show opposing results with no clear optimum in either.
I was hoping that both would show some aggrement in trend, so i could choose what number of topics to work with (see figure below).

Is the reward related to previous state or next state?

In the reinforcement learning framework, I am a little bit confused about the reward and how it is related to states. For example, in Q-learning, we have the following formula for updating the Q table:
that means that the reward is obtained from the environment at the time t+1. I mean that after applying the action at, the environment gives st+1 and rt+1.
It is often true that the reward is associated with the previous time step, that is using rt in the above formula. See, for example the Wikipedia page for Q-learning (https://en.wikipedia.org/wiki/Q-learning). Why is this?
Accidentally, some Wikipedia pages about the same topic but in different languages, use rt+1 (or unexpectedly Rt+1). See, for example, the Italian and Japanese pages:
https://it.wikipedia.org/wiki/Q-learning
https://ja.wikipedia.org/wiki/Q%E5%AD%A6%E7%BF%92

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.
References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.
What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?
Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?
Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.
Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:
Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?
I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.
Good luck!
The general rule that using the training data for evaluation might be subject to overfitting also applies to unsupervised learning like LDA -- though it is not as obvious. LDA optimizes some objective, ie. generative probability, on the training data. It might be that in the training data two words are indicative of a topic, say "white house" for US politics. Assume the two words only occur once (in the training data). Then any algorithm fully relying on the assumption that they indicate only politics and nothing else would be doing great if you evaluated on the training data. However, if there are other topics like "architecture" then you might question, whether this is really the right thing to learn. Having a test data set can solve this issue to some extend:
Since the relationship "white house" seems rare in the training data, it likely does not occur at all in the test data. If so, the evaluation shows how much your model relies on spurious relationships that might in fact not be helpful compared to more general ones.
"White house" occurs in the test data, say it occurs once for "US politics" and once in a document on architecture. Then the assumption that it only indicates "US politics" is too strong and performance metrics will be worse, showing that your model is overfitting.

What are the differences between genetic algorithms and genetic programming?

I would like to have a simple explanation of the differences between genetic algorithms and genetic programming (without too much programming jargon). Examples would also be appreciated.
Apparently, in genetic programming, solutions are computer programs. On the other hand, genetic algorithms represent a solution as a string of numbers. Any other differences?
Genetic algorithms (GA) are search algorithms that mimic the process of natural evolution, where each individual is a candidate solution: individuals are generally "raw data" (in whatever encoding format has been defined).
Genetic programming (GP) is considered a special case of GA, where each individual is a computer program (not just "raw data"). GP explore the algorithmic search space and evolve computer programs to perform a defined task.
Genetic programming and genetic algorithms are very similar. They are both used to evolve the answer to a problem, by comparing the fitness of each candidate in a population of potential candidates over many generations.
Each generation, new candidates are found by randomly changing (mutation) or swapping parts (crossover) of other candidates. The least 'fit' candidates are removed from the population.
Structural differences
The main difference between them is the representation of the algorithm/program.
A genetic algorithm is represented as a list of actions and values, often a string. for example:
1+x*3-5*6
A parser has to be written for this encoding, to understand how to turn this into a function. The resulting function might look like this:
function(x) { return 1 * x * 3 - 5 * 6; }
The parser also needs to know how to deal with invalid states, because mutation and crossover operations don't care about the semantics of the algorithm, for example the following string could be produced: 1+/3-2*. An approach needs to be decided to deal with these invalid states.
A genetic program is represented as a tree structure of actions and values, usually a nested data structure. Here's the same example, illustrated as a tree:
-
/ \
* *
/ \ / \
1 * 5 6
/ \
x 3
A parser also has to be written for this encoding, but genetic programming does not (usually) produce invalid states because mutation and crossover operations work within the structure of the tree.
Practical differences
Genetic algorithms
Inherently have a fixed length, meaning the resulting function has bounded complexity
Often produces invalid states, so these need to be handled non-destructively
Often rely on operator precedence (e.g. in our example multiplication happens before subtraction) which could be seen as a limitation
Genetic programs
Inherently have a variable length, meaning they are more flexible, but often grow in complexity
Rarely produces invalid states, these can usually be discarded
Use an explicit structure to avoid operator precedence entirely
To make it simple, (on the way I see it) Genetic Programming is an application of Genetic Algorithm. The Genetic Algorithm is used to create another solution via a computer program.
Practical answer:
GA is when using a population and evolve the generations of population to a better state.
(For example how the humans have evolved from animals to people, by breading and get better genes)
GP is when by known definition of the problem generate code into better solve a problem.
(GP will usually give a lots of if/else statements, that will explain the solution)
Lots of good partial answers above. As Koza put it in his seminal texts on the subject, "[if a GA was the best solution for a problem then a GP would evolve a GA to solve it]." Simply put, a GP is a type of GA that evolves programs that are evaluated by a cost function. The fact that the genome is a program rather than a collection of inputs for the cost function IMHO is the material difference.
https://en.wikipedia.org/wiki/Genetic_programming
genetic programming is much more powerful than genetic algorithms. the output of the genetic algorithms is a quantity while the output of the genetic programming is another computer program.

How many units should there be in each generation of a genetic algorithm?

I am working on a roguelike and am using a GA to generate levels. My question is, how many levels should be in each generation of my GA? And, how many generations should it have? It it better to have a few levels in each generation, with many generations, or the other way around?
There really isn't a hard and fast rule for this type of thing - most experiments like to use at least 200 members in a population at the barest minimum, scaling up to millions or more. The number of generations is usually in the 100 to 10,000 range. In general, to answer your final question, it's better to have lots of members in the population so that "late-bloomer" genes stay in a population long enough to mature, and then use a smaller number of generations.
But really, these aren't the important thing. The most critical part of any GA is the fitness function. If you don't have a decent fitness function that accurately evaluates what you consider to be a "good" level or a "bad" level, you're not going to end up with interesting results no matter how many generations you use, or how big your population is :)
Just as Mike said, you need to try different numbers. If you have a large population, you need to make sure to have a good selection function. With a large population, it is very easy to cause the GA to converge to a "not so good" answer early on.