Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled? - nltk

I want to do sentiment analysis using machine learning (text classification) approach. For example nltk Naive Bayes Classifier.
But the issue is that a small amount of my data is labeled. (For example, 100 articles are labeled positive or negative) and 500 articles are not labeled.
I was thinking that I train the classifier with labeled data and then try to predict sentiments of unlabeled data.
Is it possible?
I am a beginner in machine learning and don't know much about it.
I am using python 3.7.
Thank you in advance.

Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?
Yes. This is basically the definition of what supervised learning is.
I.e. you train on data that has labels, so that you can then put it into production on categorizing your data that does not have labels.
(Any book on supervised learning will have code examples.)
I wonder if your question might really be: can I use supervised learning to make a model, assign labels to another 500 articles, then do further machine learning on all 600 articles? Well the answer is still yes, but the quality will fall somewhere between these two extremes:
Assign random labels to the 500. Bad results.
Get a domain expert assign correct labels to those 500. Good results.
Your model could fall anywhere between those two extremes. It is useful to know where it is, so know if it is worth using the data. You can get an estimate of that by taking a sample, say 25 records, and have them also assigned by a domain expert. If all 25 match, there is a reasonable chance your other 475 records also have been given good labels. If e.g. only 10 of the 25 match, the model is much closer to the random end of the spectrum, and using the other 475 records is probably a bad idea.
("10", "25", etc. are arbitrary examples; choose based on the number of different labels, and your desired confidence in the results.)

Related

Analyse data with degree of affection

Hello everyone! I'm a newbie studying Data Analysis.
If you'd like to see relationship how A,B,C affects outcome, you may use several models such as KNN, SVM, Logistics regression (as far as I know).
But all of them are kinda categorical, rather than degree of affection.
Let's say, I'd like to show how Fonts and Colors contribute the degree of attraction (as shown).
What models can I use?
Thousands thanks!
If your input is only categorical variables (and having a few values each), then there are finitely many potential samples. Therefore, the model will have finitely many inputs, and, therefore, only a few outputs. Just warning.
If you use, say, KNN or random forest, you can assign L2 norm as your accuracy metric. It will emphasize that 1 is closer to 2 than 5 (pls don't forget to normalize).

What Algo to use to classify my data to 3 classes

I'm looking for a way to differentiate between 3 classes(classification problem) for each OBJECT to classify.
I have a large dataset(millions of lines). There are 2 features, each have 100 values(scaled to 0-1).
Each line refers to one sample of a specific Object(Object_id, 100 columns of my first feature, 100 of my second feature).
Each object(that has to be classified to either 3 classes) have at least 100 samples(1 sample is 1 line)
Unfortunately Classe 3 counts only 1/10 compared to 1 and 2(each object of classe 3 have around 500 samples, however classe 1 and 2 objects have around 2000 and more).
In order to do the classification, I need to take a bach of samples for each object(for exmaple 20, 50, or 100).
I dont know what algo suites better for my case, I'm new to deep learning so bear with me please
Let's break this down to two main questions: how to handle unbalanced datasets and which model to use.
Unbalanced datasets
Most machine learning algorithms are sensitive to some degree on unbalanced datasets. This is a huge challenge for Machine Learning in fields like medical diagnostics or seismology, where you have 98% "normal" readings and 2% "event" readings. There is no silver bullet to this problem. Some algorithms are more resilient to an unbalanced dataset, and some that deliberately unbalance their datasets to encourage a strong model (see bagging), and there are options to augment your data by introducing cloned data with statistical noise. However, your easiest and most effective approach is to decimate your dataset to make it balanced.
You have a class split of 2000|2000|500 datapoints. Randomly sample 500 datapoints from each of the first two classes so you have a balanced 500|500|500 dataset. It is important to randomly sample, instead of simply taking the first 500 as you want a representative sample of the class population. see the numpy.random module for how to select your datapoints.
Model selection
Although Deep Learning is portrayed as the be-all and end-all for machine learning, it represents a significant amount of time and cost to prepare, train and monitor. A typical approach to any new problem is to try some "baseline" shallow learning models. Often you'll see the following scenarios:
Your baseline models fail to train.
Your baseline model trains and fits moderately
Your baseline model trains and fits closely
In the first scenario, your deep learning model is unlikely to train either. In the third scenario there is no need to build a deep learning model when a simpler algorithm can solve it. Scenario 2 is your candidate fro deep learning.
So what models could you use?
Well, we know that it's a supervised problem, that we have a good number of samples, and that we are looking to classify. Your best bet for this kind of question is a Random Forests model. There is a good simple implementation in scikit-learn and hundreds of tutorials.
Alternatively, if you're looking at class fit through clustering, K-means ++ models (and co), or even Gaussian Mixture Models are a good place to start (again, see scikit learn's sklearn.clustering and sklearn.mixture)
If it fits well, then your work is done. If it fits moderately, think about deep learning. If it fails to fit, get add more features (and more diverse features) to your dataset.

training small amount of data on the large capacity network

Currently I am using the convolutional neural networks to solve the binary classification problem. The data I use is 2D-images and the number of training data is only about 20,000-30,000. In deep learning, it is generally known that overfitting problems can arise if the model is too complex relative to the amount of the training data. So, to prevent overfitting, the simplified model or transfer learning is used.
Previous developers in the same field did not use high-capacity models (high-capacity means a large number of model parameters) due to the small amount of training data. Most of them used small-capacity models and transfer learning.
But, when I was trying to train the data on high-capacity models (based on ResNet50, InceptionV3, DenseNet101) from scratch, which have about 10 million to 20 million parameters in, I got a high accuracy in the test set.
(Note that the training set and the test set were exclusively separated, and I used early stopping to prevent overfitting)
In the ImageNet image classification task, the training data is about 10 million. So, I also think that the amount of my training data is very small compared to the model capacity.
Here I have two questions.
1) Even though I got high accuracy, is there any reason why I should not use a small amount of data on the high-capacity model?
2) Why does it perform well? Even if there is a (very) large gap between the amount of data and the number of model parameters, the techniques like early stopping overcome the problems?
1) You're completely right that small amounts of training data can be problematic when working with a large model. Given that your ultimate goal is to achieve a "high accuracy" this theoretical limitation shouldn't bother you too much if the practical performance is satisfactory for you. Of course, you might always do better but I don't see a problem with your workflow if the score on the test data is legit and you're happy with it.
2) First of all, I believe ImageNet consists of 1.X million images so that puts you a little closer in terms of data. Here are a few ideas I can think of:
Your problem is easier to solve than ImageNet
You use image augmentation to synthetically increase your image data
Your test data is very similar to the training data
Also, don't forget that 30,000 samples means (30,000 * 224 * 224 * 3 =) 4.5 billion values. That should make it quite hard for a 10 million parameter network to simply memorize your data.
3) Welcome to StackOverflow

Mixing text and numeric features for text classification using deep learning

I have a problem about classification of text into several categories (topics). Apart from text, I have some numeric features that I believe may be useful (there are also missing values among those features). But the most important information is, of course, presented in the text. Therefore, I think deep learning approach (with a common pipeline: embedding layer + CNN or RNN with dropout + Dense layer) would be the best choice. What is the best practice to mix the current model that works only on text input with numeric features? Are there any tricks, best common practices, state-of-the-art research going on in this field? Are there any papers/experiments (on GitHub, maybe) on this topic?
It'd be great if we could think of the problem in general, but for the sake of having an idea of what sort of problem we may solve, I will give a specific example. Let's suppose we have reviews from users in which they describe a problem they faced while receiving a service or purchasing an item. The target feature is multi-label: the set of tags (categories/topics) associated with the complaint that a user had (we should choose relevant ones among a few hundreds of possible topics).
Then apart from the user's comment itself (which is the most important feature), we may want also to take into account some numerical features like price, waiting time, rating (customer satisfaction score), etc. This can potentially be useful for predicting some particular categories.
The idea is to mix all these features somehow in a deep learning model to produce the final model. Not sure if I know much about the best ways how to do it. What are the best practices / useful tricks for this kinds of problems?
For each numeric feature, statistically have a representation (you can use pandas.DataFrame.describe), also plotting the distribution would visually make you stronger.
After having the values of mean, std, max, min etc. You should get rid ofoutliers which can harm your training model. For example, if your features have its 90% of its numeric values from 18 to 72 but has also values like 1.1 or 1200 etc. you should get rid of those by equalizing them to 18 or 72 depending on the side. You can use np.clip()
After having a reasonable distribution, you should convert those numeric features to categorical features. For instance, numeric distribution from 18 to 72 can be grouped as 18, 27, 36, ......, 72, taking the intervals. You can increase the resolution or decrease it, depending on your understanding and the performance of the algorithm. You can use np.digitize() or do manually by a simple function that you can write.
In the end you have a categorical feature just like the texts. CNN or RNN can work fine with categorical representations of the numeric values as well as you get the better advantage to have feature crosses to increase your performance.
But if you ask for something of more complex, I might not have understood your question or I may not know it. Still, if you want to ask more or differently, I will be happy to try to help.

How is unsupervised deep learning used in sentiment analysis?

Typically text classification, including sentiment analysis can be performed in one of 2 ways: 1. Supervised learning if there is enough training data and 2. A unsupervised training when there is no enough training data which is not prelabeled
I have only a collection of tweets which contains only the texte (reviews) and there is no polarity fir each twwet.
My question is is there any method to di sentimeent analysis on this data using unsupervised learning?
Thank you to help me
(Based on your comment, I've concentrated on the "unsupervised" part of your question, and ignored deep learning.)
If you use something like SentiWordNet you can assign a positive or negative score to each word in a tweet, and then (as the simplest approach) sum them to get a single sentiment number for each tweet.
At this point it doesn't really matter if you are doing supervised or unsupervised learning, as either way you will have a score for each tweet, and can divide them up the tweets into, say, positive, neutral and negative sentiment. What the supervised data, the class, does allow is getting an error estimate on how well it has done at classifying them.
If you want an error estimate when your training data has no classes, you could evaluate some percentage of the tweets yourself. Even just doing 30 of them will start to give you an idea of where your grouping algorithm is on the scale from random to perfect, and won't take long.