Find if a phrase is 'generally rare' in English - nltk

I want to extract rare words from text. not rare in that text but generally rare in English.
Is there an NLTK module that uses a large corpus that can answer such a query?

as far as I know the only available corpus is for Dutch with alipo, I think you should build your own one.

Related

Understanding ConvNet Prediction on Text Classification

I'm trying to debug a model that uses 1D convolutions to classify text that was labeled by humans as being "appropriate" vs "not appropriate" to be posted on some website. Looking at false positives (wrongly predicted "appropriate"), I see that the text has mostly neutral/positive sounding words, but the idea conveyed is bad (ex: talking about "capping population"). To address a case like this, I can think of ways to help the model learn that the subject of capping population (in this example) should not be classified as "appropriate" for this particular task.
The problem I'm having is understanding what caused the model to predict "not appropriate" for messages that are in fact appropriate. For example, the following message should be considered "appropriate":
"The blame lies with the individual who commits the crime."
The model thinks that's not appropriate, but according to the labeling criteria of the dataset, that's a valid message.
Question
Given a model with an embedding layer for each word, followed by several 1D convs + dense layer, what are some techniques that can help me what is causing the model to classify that message as such, and potential ways to help the model learn that's ok?
Update
Turns out if I take the example phrase above and replace one word at a time, then see how the model classifies the resulting phrase, it classifies the phrase as being "appropriate" when I replace the word "lies" with just about any other "positive" or "neutral" word. So it seems like the model learned that "lies" is a really, really bad word. Question is: how do I create a feature(s) or otherwise help the model generalize beyond that?
Maybe in the dataset used to train the model, most of the texts containing the word "lies" (and "related" expressions) were labeled as "not appropriate" by humans, and there wasn't enough examples of "appropriate" usages (e.g. "lies are bad", "avoid spreading misinformation")
It could also be the case that many of the examples were related to the meaning "false statement" and not as many were related to other meanings.
These are some reasons I can think of for it to learn that texts containing "lies" are more likely to be "not appropriate".

How CFG and google n-gram can be combined to generate sentences

I have valid list of grammars and lexical items for generating grammatical correct phrases yet meaningless. I want to combine google n-gram to generate only the valid sentences. Is it feasible, is there any paper on this. I am using NLTK and Stanford core nlp tools.
No, it is not feasible. Real sentences have structure and meaning dependencies that go well beyond what can be captured in ngrams.
I suppose you're thinking of generating a random structure by expanding your CFG, then using ngrams to select among the possible vocabulary choices. It's a pretty simple thing to code: Chop off your grammar at the part-of-speech level, generate a "sentence" with your CFG as a string of POS tags, and use the ngrams to fill them out one by one.
To work with google's entire 5-gram collection you'll need a lot of disk space and a huge amount of RAM or some clever programming, so I recommend you experiment with one of the NLTK's tagged corpora (e.g., the Brown corpus with the "universal" tagset). Starting from any text, it is not hard to collect its ngrams, write a random text generator, and confirm that it produces semi-cohesive but undeniably incoherent (and still mostly ungrammatical) nonsense.

How to implement a simple Markov model to assign authors to anonymous texts?

Let's say I have harvested the posts from a forum. Then I removed all the usernames and signatures, so that now I only know what post was in which thread but not who posted what, or even how many authors there are (though clearly the number of authors cannot be greater than the number of texts).
I want to use a Markov model (look at which words/letters follow which ones) to figure out how many people used this forum, and which posts were written by the same person. To vastly simplify, perhaps one person tends to say "he were" while another person tends to say "he was" - I'm talking about model that works with this sort of basic logic.
Note how there are some obvious issues with the data: Some posts may be very short (one word answers). They may be repetitive (quoting each other or using popular forum catchphrases). The individual texts are not very long.
One could suspect that it would be rare for a person to make consecutive posts or that it is likely that people are more likely to post in threads they have already posted in. Exploiting this is optional.
Let's assume the posts are plaintexts and have no markup, and that everyone on the forum uses English.
I would like to obtain a distance matrix for all texts T_i such that D_ij is the probability that text T_i and text T_j are written by the same author, based on word/character pattern. I am planning to use this distance matrix to cluster the texts, and ask questions such as "What other texts were authored by the person who authored this text?"
How would I actually go about implementing this? Do I need a hidden MM? If so, what is the hidden state? I understand how to train an MM on a text and then generate a similar text (eg. generated Alice in the Wonderland) but after I train a frequency tree, how do I check a text with it to get the probability that it was generated by that tree? Should I look at letters, or words when building the tree?
My advice is put aside the business about the distance matrix and think first about a probabilistic model P(text | author). Constructing that model is that hard part of your work; once yo have it, you can compute P(author | text) via Bayes' rule. Don't put the cart before the horse: the model might or might not involve distance metrics or matrices of various kinds, but don't worry about that, just let it fall out of the model.
You might want to take a look at Hierarchical Clustering. With this algorithm you can define your own distance function and it will give you clusters based on it. If you define a good distance function, the resulting clusters will correspond to one author each.
This is probably quite hard to do though and you might need a lot of posts to really get an interesting result. Nevertheless, I wish you good luck!
You mention a Markov model in your question. Markov models are about sequences of tokens and how one token depends on previous tokens and possibly internal state.
If you want to use probabilistic methods you might want to use a different kind of statistical model that is not so much based on sequences but on bags or sets of words or features.
For example you could use the most K frequent words of the text and create all M-grams of tokens in each post where the nonfrequent words are replaced by empty placeholders. This could allow you to learn phrases commonly used by different authors.
In addition you could use single words as features, so that a post gets as features all words in the post (here you can ignore frequent words and use only rare words - the same authors might be interested in the same topics or use the same words or do the same spelling mistakes).
Additionally you can try to capture the style of authors in features: how many paragraphs, how long sentences, how many commas per sentence, does the author use capitalization or not, are numbers spelled out or not, etc ... these are all features that are not sequences as you would use in a HMM but features assigned to each post.
In summary: even though sequences are certainly important to catch phrases you definitely want more than just a sequence model.

Train Tesseract for specific words - possible?

I want to use Tesseract to extract about 10-20 keywords from a document. The document will contain all English characters/words. What I am interested in is something like "Age: 23". Here Age is the keyword I am interested in and want to extract the 23 (the value for that) as well.
The first approach that comes in my mind is to extract the whole page into text and then look for keywords in the recognized text. But in terms of training the tesseract, is there a better approach if I know the keywords, which might result in a better accuracy?
I am more or less aware of the limitations of Tesseract OCR. Trying to maximize within that limitations. Thanks for all your expert advice.
Try bazaar matching pattern in Tesseract.

Best practices for seaching for alternate forms of a word with Lucene

I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
Some examples:
Searches for "car", also add "cars" to the query.
Searches for "carry", also add "carries" and "carried" to the query.
Searches for "small", also add "smaller" and "smallest" to the query.
Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
Drawbacks:
Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
I'm not sure about the performance of searching the thesaurus file.
Generate the alternate forms algorithmically, based on some heuristics.
Some examples:
If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
Try adding "s", "es", "er" and "est" to the word.
Drawbacks:
Generates lots of non-words for most inputs.
Feels like a hack.
Looks like something you'd find on TheDailyWTF.com. :)
Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").
Consider including the PorterStemFilter in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter as a guide, it is easy to integrate with Lucene.
Word stemming works OK for English, however for languages where word stemming is nearly impossible (like mine) option #1 is viable. I know of at least one such implementation for my language (Icelandic) for Lucene that seems to work very well.
Some of those look like pretty neat ideas. Personally, I would just add some tags to the query (query transformation) to make it fuzzy, or you can use the builtin FuzzyQuery, which uses Levenshtein edit distances, which would help for mispellings.
Using fuzzy search 'query tags', Levenshtein is also used. Consider a search for 'car'. If you change the query to 'car~', it will find 'car' and 'cars' and so on. There are other transformations to the query that should handle almost everything you need.
If you're working in a specialised field (I did this with horticulture) or with a language that does't play nicely with normal stemming methods you could use the query logging to create a manual stemming table.
Just create a word -> stem mapping for all the mismatches you can think of / people are searching for, then when indexing or searching replace any word that occurs in the table with the appropriate stem. Thanks to query caching this is a pretty cheap solution.
Stemming is a pretty standard way to address this issue. I've found that the Porter stemmer is way to aggressive for standard keyword search. It ends up conflating words together that have different meanings. Try the KStemmer algorithm.