keyword extraction and Keyword based text classification - deep-learning

Currently i am working on a project which requires keywords extraction or we can say keyword based text classification . The dataset contains 3 columns text, keywords and cc terms, I need to extract keywords from text and then classify the text based on those keywords, each row in dataset has their own keywords, i want to extract similar kind of keywords. I want to train the by providing text and keyword column so that the model is able to extract keywords for unknown text.please help

Keyword extraction is typically done using TF-IDF scores simply by setting a score threshold. When training a classifier, it does not make much sense to cut off the keywords at a certain threshold, knowing that something is not likely to be a keyword might also be a valuable piece of information for the classifier.
The simplest way to get the TF-IDF scores for particular words is using TfIdfVectorizer in scikit-learn that does all the laborious text preprocessing steps (tokenization, removing stop words).
You can probably achieve better results by fine-tuning BERT for your classification task (but of course at the expense of much higher computational costs).

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

Understanding ConvNet Prediction on Text Classification

I'm trying to debug a model that uses 1D convolutions to classify text that was labeled by humans as being "appropriate" vs "not appropriate" to be posted on some website. Looking at false positives (wrongly predicted "appropriate"), I see that the text has mostly neutral/positive sounding words, but the idea conveyed is bad (ex: talking about "capping population"). To address a case like this, I can think of ways to help the model learn that the subject of capping population (in this example) should not be classified as "appropriate" for this particular task.
The problem I'm having is understanding what caused the model to predict "not appropriate" for messages that are in fact appropriate. For example, the following message should be considered "appropriate":
"The blame lies with the individual who commits the crime."
The model thinks that's not appropriate, but according to the labeling criteria of the dataset, that's a valid message.
Question
Given a model with an embedding layer for each word, followed by several 1D convs + dense layer, what are some techniques that can help me what is causing the model to classify that message as such, and potential ways to help the model learn that's ok?
Update
Turns out if I take the example phrase above and replace one word at a time, then see how the model classifies the resulting phrase, it classifies the phrase as being "appropriate" when I replace the word "lies" with just about any other "positive" or "neutral" word. So it seems like the model learned that "lies" is a really, really bad word. Question is: how do I create a feature(s) or otherwise help the model generalize beyond that?
Maybe in the dataset used to train the model, most of the texts containing the word "lies" (and "related" expressions) were labeled as "not appropriate" by humans, and there wasn't enough examples of "appropriate" usages (e.g. "lies are bad", "avoid spreading misinformation")
It could also be the case that many of the examples were related to the meaning "false statement" and not as many were related to other meanings.
These are some reasons I can think of for it to learn that texts containing "lies" are more likely to be "not appropriate".

How can I consider word dependence along with the semantic information in information retrieval?

I am working on a project that text retrieval is an important part of it. There is a reference collection (D), and users can enter queries (Q). Therefore, like a search engine, the goal is to retrieve the most related documents to each query.
I used pre-trained word embeddings to extract semantic knowledge about each word within a text. I then aggregated the continuous vectors of words to represent each text as a vector (using mean/sum aggregate function). Next, I indexed the source vectors and extracted the most similar vectors to the query vector. However, the result was not acceptable. I also tested the traditional approaches like the BOW technique. While these approaches work very well in some situations, they do not consider semantic and syntactic information (that made them not good for some queries).
Based on my investigation, considering word dependence (for example, co-occurring the words in the same sentence) along with the semantic information (obtained using the pre-trained word embeddings) can be very useful. However, I do not know how to combine them to be applicable in IR.
It should be noted that:
I'm not looking for paragraph2vec or doc2vec; those require training on a large data corpus, and I don't have a large data corpus. Instead, I want to use an existing word embeddings.
I'm not looking for a re-ranking technique like learning to rank approach. Instead, I'm looking for a way to take advantage of both syntactic and semantic information in the representation step i.e. mapping the text or query to a feature vector.
Any help would be appreciated.

How CFG and google n-gram can be combined to generate sentences

I have valid list of grammars and lexical items for generating grammatical correct phrases yet meaningless. I want to combine google n-gram to generate only the valid sentences. Is it feasible, is there any paper on this. I am using NLTK and Stanford core nlp tools.
No, it is not feasible. Real sentences have structure and meaning dependencies that go well beyond what can be captured in ngrams.
I suppose you're thinking of generating a random structure by expanding your CFG, then using ngrams to select among the possible vocabulary choices. It's a pretty simple thing to code: Chop off your grammar at the part-of-speech level, generate a "sentence" with your CFG as a string of POS tags, and use the ngrams to fill them out one by one.
To work with google's entire 5-gram collection you'll need a lot of disk space and a huge amount of RAM or some clever programming, so I recommend you experiment with one of the NLTK's tagged corpora (e.g., the Brown corpus with the "universal" tagset). Starting from any text, it is not hard to collect its ngrams, write a random text generator, and confirm that it produces semi-cohesive but undeniably incoherent (and still mostly ungrammatical) nonsense.

Train Tesseract for specific words - possible?

I want to use Tesseract to extract about 10-20 keywords from a document. The document will contain all English characters/words. What I am interested in is something like "Age: 23". Here Age is the keyword I am interested in and want to extract the 23 (the value for that) as well.
The first approach that comes in my mind is to extract the whole page into text and then look for keywords in the recognized text. But in terms of training the tesseract, is there a better approach if I know the keywords, which might result in a better accuracy?
I am more or less aware of the limitations of Tesseract OCR. Trying to maximize within that limitations. Thanks for all your expert advice.
Try bazaar matching pattern in Tesseract.