How CFG and google n-gram can be combined to generate sentences - nltk

I have valid list of grammars and lexical items for generating grammatical correct phrases yet meaningless. I want to combine google n-gram to generate only the valid sentences. Is it feasible, is there any paper on this. I am using NLTK and Stanford core nlp tools.

No, it is not feasible. Real sentences have structure and meaning dependencies that go well beyond what can be captured in ngrams.
I suppose you're thinking of generating a random structure by expanding your CFG, then using ngrams to select among the possible vocabulary choices. It's a pretty simple thing to code: Chop off your grammar at the part-of-speech level, generate a "sentence" with your CFG as a string of POS tags, and use the ngrams to fill them out one by one.
To work with google's entire 5-gram collection you'll need a lot of disk space and a huge amount of RAM or some clever programming, so I recommend you experiment with one of the NLTK's tagged corpora (e.g., the Brown corpus with the "universal" tagset). Starting from any text, it is not hard to collect its ngrams, write a random text generator, and confirm that it produces semi-cohesive but undeniably incoherent (and still mostly ungrammatical) nonsense.

Related

keyword extraction and Keyword based text classification

Currently i am working on a project which requires keywords extraction or we can say keyword based text classification . The dataset contains 3 columns text, keywords and cc terms, I need to extract keywords from text and then classify the text based on those keywords, each row in dataset has their own keywords, i want to extract similar kind of keywords. I want to train the by providing text and keyword column so that the model is able to extract keywords for unknown text.please help
Keyword extraction is typically done using TF-IDF scores simply by setting a score threshold. When training a classifier, it does not make much sense to cut off the keywords at a certain threshold, knowing that something is not likely to be a keyword might also be a valuable piece of information for the classifier.
The simplest way to get the TF-IDF scores for particular words is using TfIdfVectorizer in scikit-learn that does all the laborious text preprocessing steps (tokenization, removing stop words).
You can probably achieve better results by fine-tuning BERT for your classification task (but of course at the expense of much higher computational costs).

How can I consider word dependence along with the semantic information in information retrieval?

I am working on a project that text retrieval is an important part of it. There is a reference collection (D), and users can enter queries (Q). Therefore, like a search engine, the goal is to retrieve the most related documents to each query.
I used pre-trained word embeddings to extract semantic knowledge about each word within a text. I then aggregated the continuous vectors of words to represent each text as a vector (using mean/sum aggregate function). Next, I indexed the source vectors and extracted the most similar vectors to the query vector. However, the result was not acceptable. I also tested the traditional approaches like the BOW technique. While these approaches work very well in some situations, they do not consider semantic and syntactic information (that made them not good for some queries).
Based on my investigation, considering word dependence (for example, co-occurring the words in the same sentence) along with the semantic information (obtained using the pre-trained word embeddings) can be very useful. However, I do not know how to combine them to be applicable in IR.
It should be noted that:
I'm not looking for paragraph2vec or doc2vec; those require training on a large data corpus, and I don't have a large data corpus. Instead, I want to use an existing word embeddings.
I'm not looking for a re-ranking technique like learning to rank approach. Instead, I'm looking for a way to take advantage of both syntactic and semantic information in the representation step i.e. mapping the text or query to a feature vector.
Any help would be appreciated.

How to implement a simple Markov model to assign authors to anonymous texts?

Let's say I have harvested the posts from a forum. Then I removed all the usernames and signatures, so that now I only know what post was in which thread but not who posted what, or even how many authors there are (though clearly the number of authors cannot be greater than the number of texts).
I want to use a Markov model (look at which words/letters follow which ones) to figure out how many people used this forum, and which posts were written by the same person. To vastly simplify, perhaps one person tends to say "he were" while another person tends to say "he was" - I'm talking about model that works with this sort of basic logic.
Note how there are some obvious issues with the data: Some posts may be very short (one word answers). They may be repetitive (quoting each other or using popular forum catchphrases). The individual texts are not very long.
One could suspect that it would be rare for a person to make consecutive posts or that it is likely that people are more likely to post in threads they have already posted in. Exploiting this is optional.
Let's assume the posts are plaintexts and have no markup, and that everyone on the forum uses English.
I would like to obtain a distance matrix for all texts T_i such that D_ij is the probability that text T_i and text T_j are written by the same author, based on word/character pattern. I am planning to use this distance matrix to cluster the texts, and ask questions such as "What other texts were authored by the person who authored this text?"
How would I actually go about implementing this? Do I need a hidden MM? If so, what is the hidden state? I understand how to train an MM on a text and then generate a similar text (eg. generated Alice in the Wonderland) but after I train a frequency tree, how do I check a text with it to get the probability that it was generated by that tree? Should I look at letters, or words when building the tree?
My advice is put aside the business about the distance matrix and think first about a probabilistic model P(text | author). Constructing that model is that hard part of your work; once yo have it, you can compute P(author | text) via Bayes' rule. Don't put the cart before the horse: the model might or might not involve distance metrics or matrices of various kinds, but don't worry about that, just let it fall out of the model.
You might want to take a look at Hierarchical Clustering. With this algorithm you can define your own distance function and it will give you clusters based on it. If you define a good distance function, the resulting clusters will correspond to one author each.
This is probably quite hard to do though and you might need a lot of posts to really get an interesting result. Nevertheless, I wish you good luck!
You mention a Markov model in your question. Markov models are about sequences of tokens and how one token depends on previous tokens and possibly internal state.
If you want to use probabilistic methods you might want to use a different kind of statistical model that is not so much based on sequences but on bags or sets of words or features.
For example you could use the most K frequent words of the text and create all M-grams of tokens in each post where the nonfrequent words are replaced by empty placeholders. This could allow you to learn phrases commonly used by different authors.
In addition you could use single words as features, so that a post gets as features all words in the post (here you can ignore frequent words and use only rare words - the same authors might be interested in the same topics or use the same words or do the same spelling mistakes).
Additionally you can try to capture the style of authors in features: how many paragraphs, how long sentences, how many commas per sentence, does the author use capitalization or not, are numbers spelled out or not, etc ... these are all features that are not sequences as you would use in a HMM but features assigned to each post.
In summary: even though sequences are certainly important to catch phrases you definitely want more than just a sequence model.

Train Tesseract for specific words - possible?

I want to use Tesseract to extract about 10-20 keywords from a document. The document will contain all English characters/words. What I am interested in is something like "Age: 23". Here Age is the keyword I am interested in and want to extract the 23 (the value for that) as well.
The first approach that comes in my mind is to extract the whole page into text and then look for keywords in the recognized text. But in terms of training the tesseract, is there a better approach if I know the keywords, which might result in a better accuracy?
I am more or less aware of the limitations of Tesseract OCR. Trying to maximize within that limitations. Thanks for all your expert advice.
Try bazaar matching pattern in Tesseract.

Methods for automated synonym detection

I am currently working on a neural network based approach to short document classification, and since the corpuses I am working with are usually around ten words, the standard statistical document classification methods are of limited use. Due to this fact I am attempting to implement some form of automated synonym detection for the matches provided in the training. My question more specifically is about resolving a situation as follows:
Say I have classifications of "Involving Food", and one of "Involving Spheres" and a data set as follows:
"Eating Apples"(Food);"Eating Marbles"(Spheres); "Eating Oranges"(Food, Spheres);
"Throwing Baseballs(Spheres)";"Throwing Apples(Food)";"Throwing Balls(Spheres)";
"Spinning Apples"(Food);"Spinning Baseballs";
I am looking for an incremental method that would move towards the following linkages:
Eating --> Food
Apples --> Food
Marbles --> Spheres
Oranges --> Food, Spheres
Throwing --> Spheres
Baseballs --> Spheres
Balls --> Spheres
Spinning --> Neutral
Involving --> Neutral
I do realize that in this specific case these might be slightly suspect matches, but it illustrates the problems I am having. My general thoughts were that if I incremented a word for appearing opposite the words in a category, but in that case I would end up incidentally linking everything to the word "Involving", I then thought that I would simply decrement a word for appearing in conjunction with multiple synonyms, or with non-synonyms, but then I would lose the link between "Eating" and "Food". Does anyone have any clue as to how I would put together an algorithm that would move me in the directions indicated above?
There is an unsupervized boot-strapping approach that was explained to me to do this.
There are different ways of applying this approach, and variants, but here's a simplified version.
Concept:
Start by a assuming that if two words are synonyms, then in your corpus they will appear in similar settings. (eating grapes, eating sandwich, etc.)
(In this variant I will use co-occurence as the setting).
Boot-Strapping Algorithm:
We have two lists,
one list will contain the words that co-occur with food items
one list will contain the words that are food items
Supervized Part
Start by seeding one of the lists, for instance I might write the word Apple on the food items list.
Now let the computer take over.
Unsupervized Parts
It will first find all words in the corpus that appear just before Apple, and sort them in order of most occuring.
Take the top two (or however many you want) and add them into the co-occur with food items list. For example, perhaps "eating" and "Delicious" are the top two.
Now use that list to find the next two top food words by ranking the words that appear to the right of each word in the list.
Continue this process expanding each list until you are happy with the results.
Once that's done
(you may need to manually remove some things from the lists as you go which are clearly wrong.)
Variants
This procedure can be made quite effective if you take into account the grammatical setting of the keywords.
Subj ate NounPhrase
NounPhrase are/is Moldy
The workers harvested the Apples.
subj verb Apples
That might imply harvested is an important verb for distinguishing foods.
Then look for other occurrences of subj harvested nounPhrase
You can expand this process to move words into categories, instead of a single category at each step.
My Source
This approach was used in a system developed at the University of Utah a few years back which was successful at compiling a decent list of weapon words, victim words, and place words by just looking at news articles.
An interesting approach, and had good results.
Not a neural network approach, but an intriguing methodology.
Edit:
the system at the University of Utah was called AutoSlog-TS, and a short slide about it can be seen here towards the end of the presentation. And a link to a paper about it here
You could try LDA which is unsupervised. There is a supervised version of LDA but I can't remember the name! Stanford parser will have the algorithm which you can play around with. I understand it's not the NN approach you are looking for. But if you are just looking to group information together LDA would seem appropriate, especially if you are looking for 'topics'
The code here (http://ronan.collobert.com/senna/) implements a neural network to perform a variety on NLP tasks. The page also links to a paper that describes one of the most successful approaches so far of applying convolutional neural nets to NLP tasks.
It is possible to modify their code to use the trained networks that they provide to classify sentences, but this may take more work than you were hoping for, and it can be tricky to correctly train neural networks.
I had a lot of success using a similar technique to classify biological sequences, but, in contrast to English language sentences, my sequences had only 20 possible symbols per position rather than 50-100k.
One interesting feature of their network that may be useful to you is their word embeddings. Word embeddings map individual words (each can be considered an indicator vector of length 100k) to real valued vectors of length 50. Euclidean distance between the embedded vectors should reflect semantic distance between words, so this could help you detect synonyms.
For a simpler approach WordNet (http://wordnet.princeton.edu/) provides lists of synonyms, but I have never used this myself.
I'm not sure if I misunderstand your question. Do you require the system to be able to reason based on your input data alone, or would it be acceptable to refer to an external dictionary?
If it is acceptable, I would recommend you to take a look at http://wordnet.princeton.edu/ which is a database of English word relationships. (It also exists for a few other languges.) These relationships include synonyms, antonyms, hyperonyms (which is what you really seem to be looking for, rather than synonyms), hyponyms, etc.
The hyperonym / hyponym relationship links more generic terms to more specific ones. The words "banana" and "orange" are hyponyms of "fruit"; it is a hyperonym of both. http://en.wikipedia.org/wiki/Hyponymy Of course, "orange" is ambiguous, and is also a hyponym of "color".
You asked for a method, but I can only point you to data. Even if this turns out to be useful, you will obviously need quite a bit of work to use it for your particular application. For one thing, how do you know when you have reached a suitable level of abstraction? Unless your input is hevily normalized, you will have a mix of generic and specific terms. Do you stop at "citrus","fruit", "plant", "animate", "concrete", or "noun"? (Sorry, just made up this particular hierarchy.) Still, hope this helps.