Removing outlier documents from corpus

Removing outlier documents from corpus - lda

How can I remove outlier documents from corpus before passing it to LDA ?
I am doing topic modeling using LDA.I have a large source of data from different websites. I want to classify them into 5 categories but the presence of outlier documents give inaccurate results.
Can anyone please help with this issue.I want only those articles related to any 5 categories to be present after classification.

You need to take a subset of your current dataset as input in your model. Are there particular characteristics of the articles that are outliers? If, for example, the length of some articles are too long, you could subset by:
corpus = corpus[corpus['text'].str.len() < 1000]
Or, if you find some outliers manually, you could delete them manually by:
corpus = corpus[corpus['title'] != 'Stackoverflow saved my life']

Easy way: throw out words that are so frequent that they tell us little about the topic as well as words that are too infrequent appearing in < 15 rows then keep 100,000 words off the top
dictionary_15.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
Hard way: If you only want documents within a certain topic you can build a two layer LDA which first allocates topics generally then build a second lda by filtering those documents classified in the first layer to your target topic and allocating. I would build an LDA with say five topics output them into a csv then create a new doc by sorting and filtering in Alteryx or even excel might be easier than python and using that document to perform the second step.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

Creating more relevant results from LDA topic modeling?

I am doing a project for my degree and I have an actual client from another college. They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me. I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.
What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.
He also wants me to compare the frequency of each topic from the sql files of other years. And, he wants these topics to have a name generated smartly from the computer.
I have topic models per year and overall, but of course they do not appear exactly the same way in each year. My biggest concern is the first thing he wants with the removal process. Is any of this possible? I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.
Thank you!

Show some of the code you use so we can give you more useful tips. Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.
By the whole word-removal thing do you mean stop words too? Or did you already remove those? Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.
First you remove the stop words to make the most common words list more useful.
Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.
After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year. If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.
The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.
The "easy" way would be e.g. just use the most frequent word in each topic as label but the results will probably be underwhelming.
A more advanced approach is Topic Labeling. Or you can try an approach like modified text summarization using more powerful models.

Input documents to LDA

Assume I have N text documents and I run LDA in the following 2 ways,
run LDA over the N documents at once
run on each document separately, so for N documents you run the algorithm N times
I'm aware of what number of topics to choose as well; in the first case i can select N to be the number of topics (assuming each document is about a single topic) but if I run it on each document separately not sure how to select the number of topics as well...?
What's going on in these two cases?

Latent Dirichlet Allocation is intended to model the topic and word distributions for each document in a corpus of documents.
Running LDA over all of the documents in the corpus at once is the normal approach; running it on a per-document basis is not something I've heard of. I wouldn't recommend doing this. It's difficult to say what would happen, but I wouldn't expect the results to be near as useful because you couldn't meaningfully compare one document/topic or topic/word distribution with another.
I'm thinking that your choice of N for the number of topics might be too high (what if you had thousands of documents in your corpus?), but it really depends on the nature of the corpus you are modelling. Remember that LDA assumes a document will be a distribution over topics, so it might be worth rethinking the assumption that each document is about one topic.

LDA is a statistical model that predicts or assigns topics to documents, it works by distributing the words of each document over topics, (randomly the first time) then repeats this step a number of iterations (could be 500 iterations) until the words that are assigned to the topics are almost stable, now it can assign N topics to a document according to the most frequent words in the document that has a high probability in the topic.
so it does not make sense to run it over one document since the words that is assigned to the topic in the first iteration will not change over iterations because you are using only one document, and the topics that is assigned to document will be meaningless

Efficient way to find 200,000 product names in 20 million articles?

We have two (MySQL) databases, one with about 200.000 products (like "Samsung Galaxy S4", db-size 200 MB) and one with about 10 million articles (plain text, db-size 20GB) which can contain zero, one or many of the product names from the product database. Now we want to find product names in the article texts and store them as facets of the articles while indexing them in elasticsearch. Using regular expressions to find the products is pretty slow, we looked at Apache OpenNLP and Stanford Named Entity Recognizer, for both we have to train our own models and there are some projects at github for integrating those NER tools into elasticsearch, but they don't seem to be ready for production use.
Products and articles are added every day, so we have to run a complete recognition every day. Is NER the way to go? Or any other ideas? We don't have to understand the grammer etc. of the text, we only have to find the product name strings as fast as possible. We can't do the calculation in realtime because that's way to slow, so we have to pre-calculate the connection between articles and products and store them as facets, so we can query them pretty fast in our application.
So what's your recommendation to find so many product names in so many articles?

One of the issues you'll run into the most is consistency... new articles and new product names are always coming in and you'll have an "eventual consistency" problem. So there are three approaches that come to mind that I have used to tackle this kind of problem.
As suggested, use a full text search in MySQL, basically create a loop over your products table, and for each product name do a MATCH AGAIST query and insert productkey, and article key into a tie table. THis is fast, I used to run a system in SQL Server with over 90000 items being searched against 1B sentences. If you had a multithreaded java program that chunked up the categories and exectured the full text query, you may be surpised how fast this will be. Also, this can hammer your DB server.
Use Regex. Put all the products in a collection in memory, and regex find with that list against every document. This CAN be fast if you have your docs in something like hadoop, where it can be parallelized. You could run the job at night, and have it populate a MySQL table... This approach means you will have to start storing your docs in HDFS or some NOSQL solution, or import from MySQL to hadoop daily etc etc.
You can try doing it "at index time", so when a record is indexed in ElasticSearch the extraction will happen then and your facets will be built. I have only used SOLR for stuff like this... problem here is that when you add new products you will have to process in batch again anyway because the previously index docs will not have had the new products extracted from them.
so there may be better options, but the one that scales infinitely (if you can afford the machines) is option 2... the hadoop job.... but this means big change.
These are just my thoughts, so I hope others come up with more clever ideas
EDIT:
As for using NER, I have used NER extensively, mainly OpenNLP, and the problem with this is that what it extracts will not be normalized, or to put it another way, it may extract pieces and parts of a product name, and you will be left dealing with things like fuzzy string matching to align the NER Results to the table of products. OpenNLP 1.6 trunk has a component called the EntityLinker, which is designed for this type of thing (linking NER results to authoritative databases). Also, NER/NLP will not solve the consistency problem, because every time you change your NER model, you will have to reprocess.

I'd suggest a preprocessing step : tokenization. If you do so for the product list and for the incoming articles, than you won't need to have a per-product search : the product list would be an automata where each transition is a given token.
That gives us a trie that you'll use to match products against texts, searching will look like :
products = []
availableNodes = dictionary.root
foreach token in text:
foreach node in availableNodes:
if node.productName:
products.append(node.productName)
nextAvailableNodes = [dictionary.root]
foreach node in availableNodes:
childNode = node.getChildren(token)
if childNode:
nextAvailableNodes.append(childNode)
availableNodes = nextAvailableNodes
As far as I can tell, this algorithm is quite efficient and it allows you to fine-tune node.getChildren() function (e.g. to address capitalization or diacritics issues). Loading products lists as a a trie may take some time , in that case you could cache it as a binary file.
This simple method can easily be distributed using Hadoop or other MapReduce approach, either over texts or over product list, see for instance this article (but you'll probably need more recent / accurate ones).

How to automatically classify words in the dictionary?

I have a large dictionary file, dic.txt (its actually the SOWPODS) with one word from the English language per line. I want to automatically split this file into 3 different files easy_dic.txt (most common every day words we use - vocabulary of a 16 year old), medium_dic.txt (words not that much in common usage but still known to many people - knowledge of a 30 year old minus words found in easy_dic.txt), hard_dic.txt (very esoteric words that only professional Scrabble players would know). What's the easiest way (you can use any resources from the internet) to accomplish this?

Google has the right tool :), and shares its DB!
The Ngram viewer is a tool to check out and compare the frequency of appearance of words in literature, magazines, etc.
You can download the DB, and train your dictionaries from here.
HTH!
BTW The tool is VERY fun to use and discover the word's birth and disappearance dates.

Take some books (preferably from you three categories) that are available in a computer-readable form.
Create histograms for all words from those books.
Merge the histograms for all books from each category.
When processing your dictionary, check in which category's histogram the word has the highest count and put the word in this category.
Instead of the last step you could also simply process your histograms and remove a word from all histograms except the one with the highest amount of hits. Then you already have a word list without using an external dictionary file.

Download Wikipedia dump, learn word frequencies with some Lingpipe tool(optimal data structures). Check words from dictionaries frequency distribution then split them to 3 groups.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008