What Algo to use to classify my data to 3 classes - deep-learning

I'm looking for a way to differentiate between 3 classes(classification problem) for each OBJECT to classify.
I have a large dataset(millions of lines). There are 2 features, each have 100 values(scaled to 0-1).
Each line refers to one sample of a specific Object(Object_id, 100 columns of my first feature, 100 of my second feature).
Each object(that has to be classified to either 3 classes) have at least 100 samples(1 sample is 1 line)
Unfortunately Classe 3 counts only 1/10 compared to 1 and 2(each object of classe 3 have around 500 samples, however classe 1 and 2 objects have around 2000 and more).
In order to do the classification, I need to take a bach of samples for each object(for exmaple 20, 50, or 100).
I dont know what algo suites better for my case, I'm new to deep learning so bear with me please

Let's break this down to two main questions: how to handle unbalanced datasets and which model to use.
Unbalanced datasets
Most machine learning algorithms are sensitive to some degree on unbalanced datasets. This is a huge challenge for Machine Learning in fields like medical diagnostics or seismology, where you have 98% "normal" readings and 2% "event" readings. There is no silver bullet to this problem. Some algorithms are more resilient to an unbalanced dataset, and some that deliberately unbalance their datasets to encourage a strong model (see bagging), and there are options to augment your data by introducing cloned data with statistical noise. However, your easiest and most effective approach is to decimate your dataset to make it balanced.
You have a class split of 2000|2000|500 datapoints. Randomly sample 500 datapoints from each of the first two classes so you have a balanced 500|500|500 dataset. It is important to randomly sample, instead of simply taking the first 500 as you want a representative sample of the class population. see the numpy.random module for how to select your datapoints.
Model selection
Although Deep Learning is portrayed as the be-all and end-all for machine learning, it represents a significant amount of time and cost to prepare, train and monitor. A typical approach to any new problem is to try some "baseline" shallow learning models. Often you'll see the following scenarios:
Your baseline models fail to train.
Your baseline model trains and fits moderately
Your baseline model trains and fits closely
In the first scenario, your deep learning model is unlikely to train either. In the third scenario there is no need to build a deep learning model when a simpler algorithm can solve it. Scenario 2 is your candidate fro deep learning.
So what models could you use?
Well, we know that it's a supervised problem, that we have a good number of samples, and that we are looking to classify. Your best bet for this kind of question is a Random Forests model. There is a good simple implementation in scikit-learn and hundreds of tutorials.
Alternatively, if you're looking at class fit through clustering, K-means ++ models (and co), or even Gaussian Mixture Models are a good place to start (again, see scikit learn's sklearn.clustering and sklearn.mixture)
If it fits well, then your work is done. If it fits moderately, think about deep learning. If it fails to fit, get add more features (and more diverse features) to your dataset.

Related

Making smaller models from pre-existing redundant models

Sorry for the vague title.
I will just start with an example. Say that I have a pre-existing model that can classify dogs, cats and humans. However, all I need is a model that can classify between dogs and cats (Humans are not needed). The pre-existing model is heavy and redundant, so I want to make a smaller, faster model that can just do the job needed.
What approaches exist?
I thought of utilizing knowledge distillation (using the previous model as a teacher and the new model as the student) and training a whole new model.
First, prune the teacher model to have a smaller version to be used as a student in distillation. A simple regime such as magnitude-based pruning will suffice.
For distillation, as your output vectors will not match anymore (the student is 2 dimensional and teacher is 3 dimensional, you will have to take this into account and only calculate this distillation loss based on the overlapping dimensions. An alternative is layer-wise distillation in which the output vectors are irrelevant and the distillation loss is calculated based on the difference between intermediate layers of the teacher and student. In both cases, total loss may include a difference between student output and label, in addition to student output and teacher output.
It is possible for a simple task like this that just basic transfer learning would suffice after pruning - that is just to replace the 3d output vector with a 2d output vector and continue training.

Analyse data with degree of affection

Hello everyone! I'm a newbie studying Data Analysis.
If you'd like to see relationship how A,B,C affects outcome, you may use several models such as KNN, SVM, Logistics regression (as far as I know).
But all of them are kinda categorical, rather than degree of affection.
Let's say, I'd like to show how Fonts and Colors contribute the degree of attraction (as shown).
What models can I use?
Thousands thanks!
If your input is only categorical variables (and having a few values each), then there are finitely many potential samples. Therefore, the model will have finitely many inputs, and, therefore, only a few outputs. Just warning.
If you use, say, KNN or random forest, you can assign L2 norm as your accuracy metric. It will emphasize that 1 is closer to 2 than 5 (pls don't forget to normalize).

Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?

I want to do sentiment analysis using machine learning (text classification) approach. For example nltk Naive Bayes Classifier.
But the issue is that a small amount of my data is labeled. (For example, 100 articles are labeled positive or negative) and 500 articles are not labeled.
I was thinking that I train the classifier with labeled data and then try to predict sentiments of unlabeled data.
Is it possible?
I am a beginner in machine learning and don't know much about it.
I am using python 3.7.
Thank you in advance.
Is it possible to train the sentiment classification model with the labeled data and then use it to predict sentiment on data that is not labeled?
Yes. This is basically the definition of what supervised learning is.
I.e. you train on data that has labels, so that you can then put it into production on categorizing your data that does not have labels.
(Any book on supervised learning will have code examples.)
I wonder if your question might really be: can I use supervised learning to make a model, assign labels to another 500 articles, then do further machine learning on all 600 articles? Well the answer is still yes, but the quality will fall somewhere between these two extremes:
Assign random labels to the 500. Bad results.
Get a domain expert assign correct labels to those 500. Good results.
Your model could fall anywhere between those two extremes. It is useful to know where it is, so know if it is worth using the data. You can get an estimate of that by taking a sample, say 25 records, and have them also assigned by a domain expert. If all 25 match, there is a reasonable chance your other 475 records also have been given good labels. If e.g. only 10 of the 25 match, the model is much closer to the random end of the spectrum, and using the other 475 records is probably a bad idea.
("10", "25", etc. are arbitrary examples; choose based on the number of different labels, and your desired confidence in the results.)

What is the best way to represent a collection of documents in a fixed length vector?

I am trying to build a deep neural networks that takes in a set of documents and predicts the category it belongs.
Since number of documents in each collection is not fixed, my first attempt was to get a mapping of documents from doc2vec and use the average.
The accuracy on training is high as 90% but the testing accuracy is low as 60%.
Is there a better way of representing a collection of documents as a fixed length vector so that the words they have in common are captured?
The description of your process so far is a bit vague and unclear – you may want to add more detail to your question.
Typically, Doc2Vec would convert each doc to a vector, not "a collection of documents".
If you did try to collapse a collection into a single vector – for example, by averaging many doc-vecs, or calculating a vector for a synthetic document with all the sub-documents' words – you might be losing valuable higher-dimensional structure.
To "predict the category" would be a typical "classification" problem, and with a bunch of documents (represented by their per-doc vectors) and known-labels, you could try various kinds of classifiers.
I suspect from your description, that you may just be collapsing a category to a single vector, then classifying new documents by checking which existing category-vector they're closest-to. That can work – it's vaguely a K-Nearest-Neighbors approach, but with every category reduced to one summary vector rather than the full set of known examples, and each classification being made by looking at just one single nearest-neighbor. That forces a simplicity on the process that may not match the "shapes" of the real categories as well as a true KNN classifier, or other classifiers, could achieve.
If accuracy on test data falls far below that observed during training, that can indicate that significant "overfitting" is occurring: the model(s) are essentially memorizing idiosyncrasies of the training data to "cheat" at answers based on arbitrary correlations, rather than learning generalizable rules. Making your model(s) smaller – such as by decreasing the dimensionality of your doc-vectors – may help in such situations, by giving the model less extra state in which to remember peculiarities of the training data. More data can also help - as the "noise" in more numerous varied examples tends of cancel itself out, rather than achieve the sort of misguided importance that can be learned in smaller datasets.
There are other ways to convert a variable-length text into a fixed-length vector, including many based on deeper learning algorithms. But, those can be even more training-data-hungry, and it seems like you may have other factors to improve before trying those in-lieu-of Doc2Vec.

training small amount of data on the large capacity network

Currently I am using the convolutional neural networks to solve the binary classification problem. The data I use is 2D-images and the number of training data is only about 20,000-30,000. In deep learning, it is generally known that overfitting problems can arise if the model is too complex relative to the amount of the training data. So, to prevent overfitting, the simplified model or transfer learning is used.
Previous developers in the same field did not use high-capacity models (high-capacity means a large number of model parameters) due to the small amount of training data. Most of them used small-capacity models and transfer learning.
But, when I was trying to train the data on high-capacity models (based on ResNet50, InceptionV3, DenseNet101) from scratch, which have about 10 million to 20 million parameters in, I got a high accuracy in the test set.
(Note that the training set and the test set were exclusively separated, and I used early stopping to prevent overfitting)
In the ImageNet image classification task, the training data is about 10 million. So, I also think that the amount of my training data is very small compared to the model capacity.
Here I have two questions.
1) Even though I got high accuracy, is there any reason why I should not use a small amount of data on the high-capacity model?
2) Why does it perform well? Even if there is a (very) large gap between the amount of data and the number of model parameters, the techniques like early stopping overcome the problems?
1) You're completely right that small amounts of training data can be problematic when working with a large model. Given that your ultimate goal is to achieve a "high accuracy" this theoretical limitation shouldn't bother you too much if the practical performance is satisfactory for you. Of course, you might always do better but I don't see a problem with your workflow if the score on the test data is legit and you're happy with it.
2) First of all, I believe ImageNet consists of 1.X million images so that puts you a little closer in terms of data. Here are a few ideas I can think of:
Your problem is easier to solve than ImageNet
You use image augmentation to synthetically increase your image data
Your test data is very similar to the training data
Also, don't forget that 30,000 samples means (30,000 * 224 * 224 * 3 =) 4.5 billion values. That should make it quite hard for a 10 million parameter network to simply memorize your data.
3) Welcome to StackOverflow