LDA topic modeling - Training and testing

LDA topic modeling - Training and testing - lda

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.
References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.
What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?
Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?

Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.
Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:
Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?
I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.
Good luck!

The general rule that using the training data for evaluation might be subject to overfitting also applies to unsupervised learning like LDA -- though it is not as obvious. LDA optimizes some objective, ie. generative probability, on the training data. It might be that in the training data two words are indicative of a topic, say "white house" for US politics. Assume the two words only occur once (in the training data). Then any algorithm fully relying on the assumption that they indicate only politics and nothing else would be doing great if you evaluated on the training data. However, if there are other topics like "architecture" then you might question, whether this is really the right thing to learn. Having a test data set can solve this issue to some extend:
Since the relationship "white house" seems rare in the training data, it likely does not occur at all in the test data. If so, the evaluation shows how much your model relies on spurious relationships that might in fact not be helpful compared to more general ones.
"White house" occurs in the test data, say it occurs once for "US politics" and once in a document on architecture. Then the assumption that it only indicates "US politics" is too strong and performance metrics will be worse, showing that your model is overfitting.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

Understanding ConvNet Prediction on Text Classification

I'm trying to debug a model that uses 1D convolutions to classify text that was labeled by humans as being "appropriate" vs "not appropriate" to be posted on some website. Looking at false positives (wrongly predicted "appropriate"), I see that the text has mostly neutral/positive sounding words, but the idea conveyed is bad (ex: talking about "capping population"). To address a case like this, I can think of ways to help the model learn that the subject of capping population (in this example) should not be classified as "appropriate" for this particular task.
The problem I'm having is understanding what caused the model to predict "not appropriate" for messages that are in fact appropriate. For example, the following message should be considered "appropriate":
"The blame lies with the individual who commits the crime."
The model thinks that's not appropriate, but according to the labeling criteria of the dataset, that's a valid message.
Question
Given a model with an embedding layer for each word, followed by several 1D convs + dense layer, what are some techniques that can help me what is causing the model to classify that message as such, and potential ways to help the model learn that's ok?
Update
Turns out if I take the example phrase above and replace one word at a time, then see how the model classifies the resulting phrase, it classifies the phrase as being "appropriate" when I replace the word "lies" with just about any other "positive" or "neutral" word. So it seems like the model learned that "lies" is a really, really bad word. Question is: how do I create a feature(s) or otherwise help the model generalize beyond that?

Maybe in the dataset used to train the model, most of the texts containing the word "lies" (and "related" expressions) were labeled as "not appropriate" by humans, and there wasn't enough examples of "appropriate" usages (e.g. "lies are bad", "avoid spreading misinformation")
It could also be the case that many of the examples were related to the meaning "false statement" and not as many were related to other meanings.
These are some reasons I can think of for it to learn that texts containing "lies" are more likely to be "not appropriate".

What is the best way to represent a collection of documents in a fixed length vector?

I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?

The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.

Generative model and inference

I was looking at the hLDA model here:
https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf
I have questions on how the generative model works. What will be the output of the generative model and how is it used in the inference(Gibbs sampling) stage. I am getting mixed up with the generative model and inference part and am not able to distinguish between them.
I am new to this area and any reference articles or papers that can be useful to clear the concept would be very useful.

To get a handle on how this type of Bayesian model works, I'd recommend David Blei's original 2003 LDA paper (google scholar "Latent Dirichlet Allocation" and it'll appear near the top). They used variational inference (as opposed to Gibbs sampling) to estimate the "posterior" (which you could call the "best fit solution"), but the principles behind using a generative model are well explained.
In a nutshell, Bayesian topic models work like this: you presume that your data is created by some "generative model". This model describes a probabilistic process for generating data, and has a few unspecified "latent" variables. In a topic model, these variables are the "topics" that you're trying to find. The idea is to find the most probable values for the "topics" given the data at hand.
In Bayesian inference these most probable values for latent variables are known as the "posterior". Strictly speaking, the posterior is actually a probability distribution over possible values for the latent variables, but a common approach is to use the most probable set of values, called "maximum a posteriori" or MAP estimation.
Note that for topic models, what you get is an estimate for the true MAP values. Many of the latent values, perhaps especially those close to zero, are essentially noise, and cannot be taken seriously (except for being close to zero). It's the larger values that are more meaningful.

How do I evaluate a text summarization tool?

I have written a system that summarizes a long document containing thousands of words. Are there any norms on how such a system should be evaluated in the context of a user survey?
In short, is there a metric for evaluating the time that my tool has saved a human? Currently, I was thinking of using the (Time taken to read the original document/Time taken to read the summary) as a way of determining the time saved, but are there better metrics?
Currently, I am asking the user subjective questions about the accuracy of the summary.

In general:
Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.
Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.
Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words/ngrams from the system results appearing in the human references you will have high Bleu, and if you have many words/ngrams from the human references appearing in the system results you will have high Rouge.
There's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.
You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.
Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)

BLEU
Bleu measures precision
Bilingual Evaluation Understudy
Originally for machine translation(Bilingual)
W(machine generates summary) in (Human reference Summary)
That is how much the word (and/or n-grams) in the machine generated summaries appeared in the human reference summaries
The closer a machine translation is to a professional human translation, the better it is
ROUGE
Rouge measures recall
Recall Oriented Understudy for Gisting Evaluation
-W(Human Reference Summary) In w(machine generates summary)
That is how much the words (and/or n-grams) in the machine generates summaries appeared in the machine generated summaries.
Overlap of N-grams between the system and references summaries.
-Rouge N, ehere N is n-gram
reference_text = """Artificial intelligence (AI, also machine intelligence, MI) is intelligence demonstrated by machines, in contrast to the natural intelligence (NI) displayed by humans and other animals. In computer science AI research is defined as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving". See glossary of artificial intelligence. The scope of AI is disputed: as machines become increasingly capable, tasks considered as requiring "intelligence" are often removed from the definition, a phenomenon known as the AI effect, leading to the quip "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from "artificial intelligence", having become a routine technology. Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go), autonomous cars, intelligent routing in content delivery networks, military simulations, and interpreting complex data, including images and videos. Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into subfields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g. "robotics" or "machine learning"), the use of particular tools ("logic" or "neural networks"), or deep philosophical differences. Subfields have also been based on social factors (particular institutions or the work of particular researchers). The traditional problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, neural networks and methods based on statistics, probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy and many others. The field was founded on the claim that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabatedly. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science."""
Abstractive summarization
# Abstractive Summarize
len(reference_text.split())
from transformers import pipeline
summarization = pipeline("summarization")
abstractve_summarization = summarization(reference_text)[0]["summary_text"]
Abstractive Output
In computer science AI research is defined as the study of "intelligent agents" Colloquially, the term "artificial intelligence" is applied when a machine mimics "cognitive" functions that humans associate with other human minds, such as "learning" and "problem solving" Capabilities generally classified as AI as of 2017 include successfully understanding human speech, competing at a high level in strategic game systems (such as chess and Go)
EXtractive summarization
# Extractive summarize
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
parser = PlaintextParser.from_string(reference_text, Tokenizer("english"))
# parser.document.sentences
summarizer = LexRankSummarizer()
extractve_summarization = summarizer(parser.document,2)
extractve_summarization) = ' '.join([str(s) for s in list(extractve_summarization)])
Extractive Output
Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(abstractve_summarization, reference_text)
Using Rouge Abstractive summary output
[{'rouge-1': {'f': 0.22299651364421083,
'p': 0.9696969696969697,
'r': 0.12598425196850394},
'rouge-2': {'f': 0.21328671127225052,
'p': 0.9384615384615385,
'r': 0.1203155818540434},
'rouge-l': {'f': 0.29041095634452996,
'p': 0.9636363636363636,
'r': 0.17096774193548386}}]
Using Rouge to Evaluate abstractive Summary
from rouge import Rouge
r = Rouge()
r.get_scores(extractve_summarization, reference_text)
Using Rouge Extractive summary output
[{'rouge-1': {'f': 0.27860696251962963,
'p': 0.8842105263157894,
'r': 0.16535433070866143},
'rouge-2': {'f': 0.22296172781038814,
'p': 0.7127659574468085,
'r': 0.13214990138067062},
'rouge-l': {'f': 0.354755780824869,
'p': 0.8734177215189873,
'r': 0.22258064516129034}}]
Interpreting rouge scores
ROUGE is a score of overlapping words. ROUGE-N refers to overlapping n-grams. Specifically:
I tried to simplify the notation when compared with the original paper. Let's assume we are calculating ROUGE-2, aka bigram matches. The numerator ∑s loops through all bigrams in a single reference summary and calculates the number of times a matching bigram is found in the candidate summary (proposed by the summarization algorithm). If there are more than one reference summary, ∑r ensures we repeat the process over all reference summaries.
The denominator simply counts the total number of bigrams in all reference summaries. This is the process for one document-summary pair. You repeat the process for all documents, and average all the scores and that gives you a ROUGE-N score. So a higher score would mean that on average there is a high overlap of n-grams between your summaries and the references.
Example:
S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police
S1 is the reference and S2 and S3 are candidates. Note S2 and S3 both have one overlapping bigram with the reference, so they have the same ROUGE-2 score, although S2 should be better. An additional ROUGE-L score deals with this, where L stands for Longest Common Subsequence. In S2, the first word and last two words match the reference, so it scores 3/4, whereas S3 only matches the bigram, so scores 2/4.

Historically, summarization systems have often been evaluated by comparing to human-generated reference summaries. In some cases, the human summarizer constructs a summary by selecting relevant sentences from the original document; in others, the summaries are hand-written from scratch.
Those two techniques are analogous to the two major categories of automatic summarization systems - extractive vs. abstractive (more details available on Wikipedia).
One standard tool is Rouge, a script (or a set of scripts; I can't remember offhand) that computes n-gram overlap between the automatic summary and a reference summary. Rough can optionally compute overlap allowing word insertions or deletions between the two summaries (e.g. if allowing a 2-word skip, 'installed pumps' would be credited as a match to 'installed defective flood-control pumps').
My understanding is that Rouge's n-gram overlap scores were fairly well correlated with human evaluation of summaries up to some level of accuracy, but that the relationship may break down as summarization quality improves. I.e., that beyond some quality threshold, summaries that are judged better by human evaluators may be scored similarly to - or outscored by - summaries judged inferior. Nevertheless, Rouge scores might be a helpful first cut at comparing 2 candidate summarization systems, or a way to automate regression testing and weed out serious regressions before passing a system on to human evaluators.
Your approach of collecting human judgements is probably the best evaluation, if you're able to afford the time / monetary cost. To add a little rigor to that process, you might look at the scoring criteria used in recent summarization tasks (see the various conferences mentioned by #John Lehmann). The scoresheets used by those evaluators might help guide your own evaluation.

I'm not sure about the time evaluation, but regarding accuracy you might consult literature under the topic Automatic Document Summarization. The primary evaluation was the Document Understanding Conference (DUC) until the Summarization task was moved into Text Analysis Conference (TAC) in 2008. Most of these focus on advanced summarization topics such as multi-document, multi-lingual, and update summaries.
You can find the evaluation guidelines for each of these events posted online. For single document summarization tasks look at DUC 2002-2004.
Or, you might consult the ADS evaluation section in Wikipedia.

There is also the very recent BERTScore metric (arXiv'19, ICLR'20, already almost 90 citations) that does not suffer from the well-known issues of ROUGE and BLEU.
Abstract from the paper:
We propose BERTScore, an automatic evaluation metric for text
generation. Analogously to common metrics, BERTScore computes a
similarity score for each token in the candidate sentence with each
token in the reference sentence. However, instead of exact matches, we
compute token similarity using contextual embeddings. We evaluate
using the outputs of 363 machine translation and image captioning
systems. BERTScore correlates better with human judgments and provides
stronger model selection performance than existing metrics. Finally,
we use an adversarial paraphrase detection task to show that BERTScore
is more robust to challenging examples when compared to existing
metrics.
Paper: https://arxiv.org/pdf/1904.09675.pdf
Code: https://github.com/Tiiiger/bert_score
Full reference:
Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019).

There are many parameters against which you can evaluate your summarization system.
like
Precision = Number of important sentences/Total number of sentences summarized.
Recall = Total number of important sentences Retrieved / Total number of important sentences present.
F Score = 2*(Precision*Recall/Precision+ Recall)
Compressed Rate = Total number of words in the summary / Total number of words in original document.

When you are evaluating an automatic summarisation system you would typically look at the content of the summary rather than time.
Your idea of:
(Time taken to read the original document/Time taken to read the summary)
Doesn't tell you much about your summarisation system, it really only gives you and idea of the compression rate of your system (i.e. the summary is 10% of the original document).
You may want to consider the time it takes your system to summarise a document vs. the time it would take a human (system: 2s, human: 10 mins).

I recommend BartScore. Check the Github page and the article. The authors issued also a meta-evaluation on the ExplainaBoard platform, "which allows to interactively understand the strengths, weaknesses, and complementarity of each metric". You can find the list of most of the state-of-the-art metrics there.

As a quick summary for collection of metrics, I wrote a post descrbing the evaluation metrics, what kind of metrics do we have ? what's the difference between human evaluation ? etc. You can read the blog post Evaluation Metrics: Assessing the quality of NLG outputs.
Also, along with the NLP projects we created and publicly released an evaluation package Jury which is still actively maintained and you can see the reasons why we created such a package in the repo. There are packages to carry out evaluation in NLP (some of them are specialized in a spesific NLP task):
jury
datasets
sacrebleu (Machine Translation)
SummEval (Summarization)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008