How is unsupervised deep learning used in sentiment analysis? - deep-learning

Typically text classification, including sentiment analysis can be performed in one of 2 ways: 1. Supervised learning if there is enough training data and 2. A unsupervised training when there is no enough training data which is not prelabeled
I have only a collection of tweets which contains only the texte (reviews) and there is no polarity fir each twwet.
My question is is there any method to di sentimeent analysis on this data using unsupervised learning?
Thank you to help me

(Based on your comment, I've concentrated on the "unsupervised" part of your question, and ignored deep learning.)
If you use something like SentiWordNet you can assign a positive or negative score to each word in a tweet, and then (as the simplest approach) sum them to get a single sentiment number for each tweet.
At this point it doesn't really matter if you are doing supervised or unsupervised learning, as either way you will have a score for each tweet, and can divide them up the tweets into, say, positive, neutral and negative sentiment. What the supervised data, the class, does allow is getting an error estimate on how well it has done at classifying them.
If you want an error estimate when your training data has no classes, you could evaluate some percentage of the tweets yourself. Even just doing 30 of them will start to give you an idea of where your grouping algorithm is on the scale from random to perfect, and won't take long.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

What Algo to use to classify my data to 3 classes

I'm looking for a way to differentiate between 3 classes(classification problem) for each OBJECT to classify.
I have a large dataset(millions of lines). There are 2 features, each have 100 values(scaled to 0-1).
Each line refers to one sample of a specific Object(Object_id, 100 columns of my first feature, 100 of my second feature).
Each object(that has to be classified to either 3 classes) have at least 100 samples(1 sample is 1 line)
Unfortunately Classe 3 counts only 1/10 compared to 1 and 2(each object of classe 3 have around 500 samples, however classe 1 and 2 objects have around 2000 and more).
In order to do the classification, I need to take a bach of samples for each object(for exmaple 20, 50, or 100).
I dont know what algo suites better for my case, I'm new to deep learning so bear with me please
Let's break this down to two main questions: how to handle unbalanced datasets and which model to use.
Unbalanced datasets
Most machine learning algorithms are sensitive to some degree on unbalanced datasets. This is a huge challenge for Machine Learning in fields like medical diagnostics or seismology, where you have 98% "normal" readings and 2% "event" readings. There is no silver bullet to this problem. Some algorithms are more resilient to an unbalanced dataset, and some that deliberately unbalance their datasets to encourage a strong model (see bagging), and there are options to augment your data by introducing cloned data with statistical noise. However, your easiest and most effective approach is to decimate your dataset to make it balanced.
You have a class split of 2000|2000|500 datapoints. Randomly sample 500 datapoints from each of the first two classes so you have a balanced 500|500|500 dataset. It is important to randomly sample, instead of simply taking the first 500 as you want a representative sample of the class population. see the numpy.random module for how to select your datapoints.
Model selection
Although Deep Learning is portrayed as the be-all and end-all for machine learning, it represents a significant amount of time and cost to prepare, train and monitor. A typical approach to any new problem is to try some "baseline" shallow learning models. Often you'll see the following scenarios:
Your baseline models fail to train.
Your baseline model trains and fits moderately
Your baseline model trains and fits closely
In the first scenario, your deep learning model is unlikely to train either. In the third scenario there is no need to build a deep learning model when a simpler algorithm can solve it. Scenario 2 is your candidate fro deep learning.
So what models could you use?
Well, we know that it's a supervised problem, that we have a good number of samples, and that we are looking to classify. Your best bet for this kind of question is a Random Forests model. There is a good simple implementation in scikit-learn and hundreds of tutorials.
Alternatively, if you're looking at class fit through clustering, K-means ++ models (and co), or even Gaussian Mixture Models are a good place to start (again, see scikit learn's sklearn.clustering and sklearn.mixture)
If it fits well, then your work is done. If it fits moderately, think about deep learning. If it fails to fit, get add more features (and more diverse features) to your dataset.

Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?

I want to do sentiment analysis using machine learning (text classification) approach. For example nltk Naive Bayes Classifier.
But the issue is that a small amount of my data is labeled. (For example, 100 articles are labeled positive or negative) and 500 articles are not labeled.
I was thinking that I train the classifier with labeled data and then try to predict sentiments of unlabeled data.
Is it possible?
I am a beginner in machine learning and don't know much about it.
I am using python 3.7.
Thank you in advance.
Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?
Yes. This is basically the definition of what supervised learning is.
I.e. you train on data that has labels, so that you can then put it into production on categorizing your data that does not have labels.
(Any book on supervised learning will have code examples.)
I wonder if your question might really be: can I use supervised learning to make a model, assign labels to another 500 articles, then do further machine learning on all 600 articles? Well the answer is still yes, but the quality will fall somewhere between these two extremes:
Assign random labels to the 500. Bad results.
Get a domain expert assign correct labels to those 500. Good results.
Your model could fall anywhere between those two extremes. It is useful to know where it is, so know if it is worth using the data. You can get an estimate of that by taking a sample, say 25 records, and have them also assigned by a domain expert. If all 25 match, there is a reasonable chance your other 475 records also have been given good labels. If e.g. only 10 of the 25 match, the model is much closer to the random end of the spectrum, and using the other 475 records is probably a bad idea.
("10", "25", etc. are arbitrary examples; choose based on the number of different labels, and your desired confidence in the results.)

Large number of training steps results in poor performance in transfer learning

I have a question. I have used transfer learning to retrain googlenet on my image classification problem. I have 80,000 images which belong to 14 categories. I set number of training steps equal to 200,000. I think the code provided by Tensorflow should have drop out implimented and it trains based on random shuffling of dataset and cross validation approach, and and I do not see any overfiting in training and classification curves, and I get high cross validation accuracy and high test accuracy but when I apply my model to new dataset then I get poor classification result. Anybodey know what is going on?Thanks!

LDA topic modeling - Training and testing

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.
References say that LDA is an algorithm which, given a collection of documents and nothing more (no supervision needed), can uncover the “topics” expressed by documents in that collection. Thus by using LDA algorithm and the Gibbs Sampler (or Variational Bayes), I can input a set of documents and as output I can get the topics. Each topic is a set of terms with assigned probabilities.
What I don't understand is, if the above is true, then why do many topic modeling tutorials talk about separating the dataset into training and test set?
Can anyone explain me the steps (the basic concept) of how LDA can be used for training a model, which can then be used to analyze another test dataset?
Splitting the data into training and testing sets is a common step in evaluating the performance of a learning algorithm. It's more clear-cut for supervised learning, wherein you train the model on the training set, then see how well its classifications on the test set match the true class labels. For unsupervised learning, such evaluation is a little trickier. In the case of topic modeling, a common measure of performance is perplexity. You train the model (like LDA) on the training set, and then you see how "perplexed" the model is on the testing set. More specifically, you measure how well the word counts of the test documents are represented by the word distributions represented by the topics.
Perplexity is good for relative comparisons between models or parameter settings, but it's numeric value doesn't really mean much. I prefer to evaluate topic models using the following, somewhat manual, evaluation process:
Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive "topic" or just some random group of words?
Inspect the topic assignments: Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually talking about?
I realize that this process isn't as nice and quantitative as one might like, but to be honest, the applications of topic models are rarely quantitative either. I suggest evaluating your topic model according to the problem you're applying it to.
Good luck!
The general rule that using the training data for evaluation might be subject to overfitting also applies to unsupervised learning like LDA -- though it is not as obvious. LDA optimizes some objective, ie. generative probability, on the training data. It might be that in the training data two words are indicative of a topic, say "white house" for US politics. Assume the two words only occur once (in the training data). Then any algorithm fully relying on the assumption that they indicate only politics and nothing else would be doing great if you evaluated on the training data. However, if there are other topics like "architecture" then you might question, whether this is really the right thing to learn. Having a test data set can solve this issue to some extend:
Since the relationship "white house" seems rare in the training data, it likely does not occur at all in the test data. If so, the evaluation shows how much your model relies on spurious relationships that might in fact not be helpful compared to more general ones.
"White house" occurs in the test data, say it occurs once for "US politics" and once in a document on architecture. Then the assumption that it only indicates "US politics" is too strong and performance metrics will be worse, showing that your model is overfitting.