I am doing a project for my degree and I have an actual client from another college. They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me. I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.
What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.
He also wants me to compare the frequency of each topic from the sql files of other years. And, he wants these topics to have a name generated smartly from the computer.
I have topic models per year and overall, but of course they do not appear exactly the same way in each year. My biggest concern is the first thing he wants with the removal process. Is any of this possible? I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.
Thank you!
Show some of the code you use so we can give you more useful tips. Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.
By the whole word-removal thing do you mean stop words too? Or did you already remove those? Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.
First you remove the stop words to make the most common words list more useful.
Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.
After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year. If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.
The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.
The "easy" way would be e.g. just use the most frequent word in each topic as label but the results will probably be underwhelming.
A more advanced approach is Topic Labeling. Or you can try an approach like modified text summarization using more powerful models.
I'm trying to debug a model that uses 1D convolutions to classify text that was labeled by humans as being "appropriate" vs "not appropriate" to be posted on some website. Looking at false positives (wrongly predicted "appropriate"), I see that the text has mostly neutral/positive sounding words, but the idea conveyed is bad (ex: talking about "capping population"). To address a case like this, I can think of ways to help the model learn that the subject of capping population (in this example) should not be classified as "appropriate" for this particular task.
The problem I'm having is understanding what caused the model to predict "not appropriate" for messages that are in fact appropriate. For example, the following message should be considered "appropriate":
"The blame lies with the individual who commits the crime."
The model thinks that's not appropriate, but according to the labeling criteria of the dataset, that's a valid message.
Question
Given a model with an embedding layer for each word, followed by several 1D convs + dense layer, what are some techniques that can help me what is causing the model to classify that message as such, and potential ways to help the model learn that's ok?
Update
Turns out if I take the example phrase above and replace one word at a time, then see how the model classifies the resulting phrase, it classifies the phrase as being "appropriate" when I replace the word "lies" with just about any other "positive" or "neutral" word. So it seems like the model learned that "lies" is a really, really bad word. Question is: how do I create a feature(s) or otherwise help the model generalize beyond that?
Maybe in the dataset used to train the model, most of the texts containing the word "lies" (and "related" expressions) were labeled as "not appropriate" by humans, and there wasn't enough examples of "appropriate" usages (e.g. "lies are bad", "avoid spreading misinformation")
It could also be the case that many of the examples were related to the meaning "false statement" and not as many were related to other meanings.
These are some reasons I can think of for it to learn that texts containing "lies" are more likely to be "not appropriate".
I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?
The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.
Let's say I have harvested the posts from a forum. Then I removed all the usernames and signatures, so that now I only know what post was in which thread but not who posted what, or even how many authors there are (though clearly the number of authors cannot be greater than the number of texts).
I want to use a Markov model (look at which words/letters follow which ones) to figure out how many people used this forum, and which posts were written by the same person. To vastly simplify, perhaps one person tends to say "he were" while another person tends to say "he was" - I'm talking about model that works with this sort of basic logic.
Note how there are some obvious issues with the data: Some posts may be very short (one word answers). They may be repetitive (quoting each other or using popular forum catchphrases). The individual texts are not very long.
One could suspect that it would be rare for a person to make consecutive posts or that it is likely that people are more likely to post in threads they have already posted in. Exploiting this is optional.
Let's assume the posts are plaintexts and have no markup, and that everyone on the forum uses English.
I would like to obtain a distance matrix for all texts T_i such that D_ij is the probability that text T_i and text T_j are written by the same author, based on word/character pattern. I am planning to use this distance matrix to cluster the texts, and ask questions such as "What other texts were authored by the person who authored this text?"
How would I actually go about implementing this? Do I need a hidden MM? If so, what is the hidden state? I understand how to train an MM on a text and then generate a similar text (eg. generated Alice in the Wonderland) but after I train a frequency tree, how do I check a text with it to get the probability that it was generated by that tree? Should I look at letters, or words when building the tree?
My advice is put aside the business about the distance matrix and think first about a probabilistic model P(text | author). Constructing that model is that hard part of your work; once yo have it, you can compute P(author | text) via Bayes' rule. Don't put the cart before the horse: the model might or might not involve distance metrics or matrices of various kinds, but don't worry about that, just let it fall out of the model.
You might want to take a look at Hierarchical Clustering. With this algorithm you can define your own distance function and it will give you clusters based on it. If you define a good distance function, the resulting clusters will correspond to one author each.
This is probably quite hard to do though and you might need a lot of posts to really get an interesting result. Nevertheless, I wish you good luck!
You mention a Markov model in your question. Markov models are about sequences of tokens and how one token depends on previous tokens and possibly internal state.
If you want to use probabilistic methods you might want to use a different kind of statistical model that is not so much based on sequences but on bags or sets of words or features.
For example you could use the most K frequent words of the text and create all M-grams of tokens in each post where the nonfrequent words are replaced by empty placeholders. This could allow you to learn phrases commonly used by different authors.
In addition you could use single words as features, so that a post gets as features all words in the post (here you can ignore frequent words and use only rare words - the same authors might be interested in the same topics or use the same words or do the same spelling mistakes).
Additionally you can try to capture the style of authors in features: how many paragraphs, how long sentences, how many commas per sentence, does the author use capitalization or not, are numbers spelled out or not, etc ... these are all features that are not sequences as you would use in a HMM but features assigned to each post.
In summary: even though sequences are certainly important to catch phrases you definitely want more than just a sequence model.
I have the following requirement: -
I have many (say 1 million) values (names).
The user will type a search string.
I don't expect the user to spell the names correctly.
So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.
My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?
2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?
I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?
3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!
4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -
Users mistyping a word (an edit
distance algorithm)
Users not knowing a word and guessing
(a phonetic match algorithm)
I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
The Soundex algorithm may help you out with this.
http://en.wikipedia.org/wiki/Soundex
You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.
the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)
(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)
As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:
If a user has done two searches in a short time window, and
the first query did not yield any results (or the user did not click on anything)
the second query did yield useful results
the two queries are similar (have a small Levenshtein distance)
then the second query is a possible refinement of the first query which you can store and present to other users.
Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.
I would consider using a pre-existing solution for this.
Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.
This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:
J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.
I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.
Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.
Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)
In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.
Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.
Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.
Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.
Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.
For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.
As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).
You have two possible issues that you need to address (or not address if you so choose)
Users mistyping a word (an edit distance algorithm)
Users not knowing a word and guessing (a phonetic match algorithm)
Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.
You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.
My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.
So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.