How to import and use feature vectors in MALLET's topic modelling? - lda

I am using MALLET's topic modelling.
I have set of keywords along with weights for a set of documents which I want to train and use the model to infer new documents.
Note: each keyword of the document has weight assigned to it which is similar to tf-idf score.
Based on what I can infer from the documentation, MALLET's topic modelling supports only sequence data and not vector data.
I want to use the weights assigned to each keyword of the document for the analysis. If I don't then each keyword would be treated equally as a result I loose important information while analysing.
Any suggestions of how I can use MALLET topic modelling for my data?

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

keyword extraction and Keyword based text classification

Currently i am working on a project which requires keywords extraction or we can say keyword based text classification . The dataset contains 3 columns text, keywords and cc terms, I need to extract keywords from text and then classify the text based on those keywords, each row in dataset has their own keywords, i want to extract similar kind of keywords. I want to train the by providing text and keyword column so that the model is able to extract keywords for unknown text.please help
Keyword extraction is typically done using TF-IDF scores simply by setting a score threshold. When training a classifier, it does not make much sense to cut off the keywords at a certain threshold, knowing that something is not likely to be a keyword might also be a valuable piece of information for the classifier.
The simplest way to get the TF-IDF scores for particular words is using TfIdfVectorizer in scikit-learn that does all the laborious text preprocessing steps (tokenization, removing stop words).
You can probably achieve better results by fine-tuning BERT for your classification task (but of course at the expense of much higher computational costs).

How can I consider word dependence along with the semantic information in information retrieval?

I am working on a project that text retrieval is an important part of it. There is a reference collection (D), and users can enter queries (Q). Therefore, like a search engine, the goal is to retrieve the most related documents to each query.
I used pre-trained word embeddings to extract semantic knowledge about each word within a text. I then aggregated the continuous vectors of words to represent each text as a vector (using mean/sum aggregate function). Next, I indexed the source vectors and extracted the most similar vectors to the query vector. However, the result was not acceptable. I also tested the traditional approaches like the BOW technique. While these approaches work very well in some situations, they do not consider semantic and syntactic information (that made them not good for some queries).
Based on my investigation, considering word dependence (for example, co-occurring the words in the same sentence) along with the semantic information (obtained using the pre-trained word embeddings) can be very useful. However, I do not know how to combine them to be applicable in IR.
It should be noted that:
I'm not looking for paragraph2vec or doc2vec; those require training on a large data corpus, and I don't have a large data corpus. Instead, I want to use an existing word embeddings.
I'm not looking for a re-ranking technique like learning to rank approach. Instead, I'm looking for a way to take advantage of both syntactic and semantic information in the representation step i.e. mapping the text or query to a feature vector.
Any help would be appreciated.

Generative model and inference

I was looking at the hLDA model here:
https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf
I have questions on how the generative model works. What will be the output of the generative model and how is it used in the inference(Gibbs sampling) stage. I am getting mixed up with the generative model and inference part and am not able to distinguish between them.
I am new to this area and any reference articles or papers that can be useful to clear the concept would be very useful.
To get a handle on how this type of Bayesian model works, I'd recommend David Blei's original 2003 LDA paper (google scholar "Latent Dirichlet Allocation" and it'll appear near the top). They used variational inference (as opposed to Gibbs sampling) to estimate the "posterior" (which you could call the "best fit solution"), but the principles behind using a generative model are well explained.
In a nutshell, Bayesian topic models work like this: you presume that your data is created by some "generative model". This model describes a probabilistic process for generating data, and has a few unspecified "latent" variables. In a topic model, these variables are the "topics" that you're trying to find. The idea is to find the most probable values for the "topics" given the data at hand.
In Bayesian inference these most probable values for latent variables are known as the "posterior". Strictly speaking, the posterior is actually a probability distribution over possible values for the latent variables, but a common approach is to use the most probable set of values, called "maximum a posteriori" or MAP estimation.
Note that for topic models, what you get is an estimate for the true MAP values. Many of the latent values, perhaps especially those close to zero, are essentially noise, and cannot be taken seriously (except for being close to zero). It's the larger values that are more meaningful.

Is there a difference between 'data structure' and 'data type'?

Two of the questions that often come up in the Uni exam, where I study, are:
Define data types. Classify and explain datatypes
Define data structures. Classify and explain data structures
Somehow, aren't they the same thing ?
Consider that you are making a Tree<E> in Java. You'd declare your class for Tree<E>, add methods to it and somewhere you would do Tree<String> myTree = new Tree<>(); to make a tree object.
Your data 'structure' is now a data 'type'.
Say if you were asked a question: Of what type is the variable myTree? The answer would be, Tree<E>. Your data 'structure' is now a data 'type'.
Now since they are the same, they will be classified in the same way depending on what basis you want to classify them on. Primitive or non primitive. Homogeneous or heterogeneous. Linear or hierarchical.
That is my understanding. Is the understanding wrong ?
I would like to correct the following to start - you created a class called "Tree" and an object called "myTree", not a variable called "myTree" of datatype "Tree". These are different things.
The following is the definition of a data type:
A data type or simply type is a classification identifying one of various types of data,
such as real-valued, integer or Boolean, that determines the possible values for that type;
the operations that can be done on values of that type; the meaning of the data;
and the way values of that type can be stored.
Now, as per Wikipedia, there are various definitions for "type" in data type.
The question you have asked is a good one. There are data types in today's modern languages, that are referred to as Abstract Data Types or ADT in short. The definition of an ADT is:
An abstract data type (ADT) is a mathematical model for a certain class of data structures that have similar behavior; or for certain data types of one or more programming languages that have similar semantics. An abstract data type is defined indirectly, only by the operations that may be performed on it and by mathematical constraints on the effects (and possibly cost) of those operations.
It is also written that:
Abstract data types are purely theoretical entities, used (among other things) to simplify the description of abstract algorithms, to classify and evaluate data structures, and to formally describe the type systems of programming languages. However, an ADT may be implemented by specific data types or data structures, in many ways and in many programming languages; or described in a formal specification language.
Meaning that ADT's can be implemented using either data types or data structures.
As for data structures:
A data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.
Many textbooks, use these words interchangeably. With more complex types, that can lead to confusion.
Take a small example: it's something of a standard to use b-tree for implementation of databases. Meaning, we know that this type of ADT is well suited for such type of a problem and can handle it more effectively. But in order to inject this effectiveness in an ADT you will need to create a data structure that will give you the desired output.
Another example: there are so many trees like b-tree, binary-search tree, AA tree, etc. All these are essentially of the type of a tree, but each one is in it's own a data structure.
Refer: List of data structures for a huge list of available structures.
The distinction is between abstract and concrete data structures. Some CS textbooks refer to abstract data structures as "data types", which is confusing because not all data types are data structures. They use "data structure" to specifically mean a concrete data structure.
An abstract data structure, also called an abstract data type, is the interface of the data structure. Java often represents them using interfaces; examples are List, Queue, Map, Deque, Set. (But there are others not represented in Java, such as bags/multisets, multimaps, graphs, stacks, and priority queues.) They are distinguished by their behavior and how you use the data structure. For instance, a set is characterized by forbidding duplicates and not recording order, whereas a list allows duplicates and remembers the order. A queue has a restricted interface that only lets you add to one end and remove from the other.
A concrete data structure is an implementation of an abstract data structure. Examples are ArrayList and LinkedList. These are both implementations of lists; while their list interface is the same, the programmer might still care about their different performance characteristics. Note that LinkedList also implements Queue.
Also, there exist data structures in programming languages that have no type system. For example, you can modell a Map in LISP, or have a dictionary in Python. It would be misleading to speak of a type here, as type does IMHO only make sense with respect to some type system, or as an abstract concept like "the set of all values that inhabit t".
So, it seems that data structure has a connotation of an concrete implementation of some abstract type. If we speak of some object in a programming language with type system, OTOH, we would probably say that "it has type XY".