Best Neural Network architecture for traditional large multiclass classification problem - deep-learning

I am new to deep learning (I just finished to read deep learning with pytorch), and I was wondering what is the best neural network architecture for my case.
I have a large multiclass classification problem (user identification problem), about 1000 classes in which each class is a user. I have about 2000 features for each user after one-hot encoding and cleaning. Data are highly imbalanced, but I can always use oversampling/downsampling techniques.
I was wondering what is the best architecture to implement for my case. I've always seen deep learning applied to time series or images, so I'm not sure about what to use in this case. I was thinking about a multi-layer perceptron but maybe there are better solutions.
Thanks for your tips and help. Have a nice day!

You can try triplet learning instead of simple classification.
From your 1000 users, you can make, c * 1000 * 999 / 2 pairs. c is the average number of samples per class/user.
https://arxiv.org/pdf/1412.6622.pdf

Related

Keras Applications (Transfer Learning)

I am a student and currently studying deep learning by myself. Here I would like to ask for clarification regarding the transfer learning.
For example MobileNetv2 (https://keras.io/api/applications/mobilenet/#mobilenetv2-function), if the weights parameter is set to None, then I am not doing transfer learning as the weights are random initialized. If I would like to do transfer learning, then I should set the weights parameter to imagenet. Is this concept correct?
Clarification and explanation regarding deep learning
Yes, when you initialize the weights with random values, you are just using the architecture and training the model from scratch. The goal of transfer learning is to use the previously gained knowledge by another trained model to get better results or to use less computational resources.
There are different ways to use transfer learning:
You can freeze the learned weights of the base model and replace the last layer of the model base on your problem and just train the last layer
You can start with the learned weights and fine-tune them (let them change in the learning process). Many people do that because sometimes it makes the training faster and gives better results because the weights already contain so much information.
You can use the first layers to extract basic features like colors, edges, circles... and add your desired layers after them. In this way, you can use your resources to learn high-level features.
There are more cases, but I hope it could give you an idea.

Train a reinforcement learning model with a large amount of images

I am tentatively trying to train a deep reinforcement learning model the maze escaping task, and each time it takes one image as the input (e.g., a different "maze").
Suppose I have about 10K different maze images, and the ideal case is that after training N mazes, my model would do a good job to quickly solve the puzzle in the rest 10K - N images.
I am writing to inquire some good idea/empirical evidences on how to select a good N for the training task.
And in general, how should I estimate and enhance the ability of "transfer learning" of my reinforcement model? Make it more generalized?
Any advice or suggestions would be appreciate it very much. Thanks.
Firstly,
I strongly recommend you to use 2D arrays for the maps of the mazes instead of images, it would do your model a huge favor, becuse it's a more feature extracted approach. try using 2D arrays in which walls are demonstrated by ones upon the ground of zeros.
And about finding the optimized N:
Your model architecture is way more important than the share of training data in all of the data or the batch sizes. It's better to make a well designed model and then to find the optimized amount of N by testing different Ns(becuse it is only one variable, the process of optimizing N can be easily done by you yourself).

Mini-batches in RL

I just read the paper of Mnih (2013) and was really wondering about the aspect that he talks about using RMSprop with minibatches of size 32 (page 6).
My understanding of these kinds of reinforcement learning algorithms is, that there is only 1 or at least very little amount of training samples per fit, and in every fit I update the network.
Whereas in supervised learning I have up to millions of samples and divide them in minibatches of e.g. 32 and update the network after every minibatch, which makes sense.
So my question is: If I put only one sample into the neural network at a time, how does minibatches make sense? Did I understand something wrong about that concept?
Thanks in advance!
The answer provided by Filip is correct. Just to add intuition to his answer, the reason why an experience replay is used is to decorrelate the experiences that the RL experienced. This is essential when non-linear function approximation is used such as neural networks.
Example: Imagine if you had 10 days to study for a chemistry and math test, and both test were on the same day. If you spend the first 5 days on chemistry and last 5 days on math, you would have forgotten most of the chemistry you studied. A neural network behaves similarly.
By decorrelating the experiences, a more general policy can be identified through the training data.
And while training the neural network, we have a batch of memory (i.e., data), and we sample random mini-batches of 32 from them to do supervised learning, just as any other neural network is trained.
The paper you mentioned introduces two mechanisms that stabilize Q-Learning method when used with a deep neural network function approximator. One of the mechanisms is called Experience Replay, and it is basically a memory buffer for observed experiences. You can find the description in the paper in the end of the fourth page. Instead of learning from the single experience you have just seen, you save it to the buffer. Learning is done every N iterations and you sample a random minibatch of experiences from the replay buffer.

Overview for Deep Learning Networks

I am fairly new to Deep Learning and get quite overwhelmed by the many different Nets and their field of application. Thus, I want to know if there is some kind of overview which kind of different networks exist, what there key-features are and what kind of purpose they have.
For example I know abut LeNet, ConvNet, AlexNet - and somehow they are the same but still differ?
There are basically two types of neural networks, supervised and unsupervised learning. Both need a training set to "learn". Imagine training set as a massive book where you can learn specific information. In supervised learning, the book is supplied with answer key but without the solution manual, in contrast, unsupervised learning comes without answer key or solution manual. But the goal is the same, which is that to find patterns between the questions and answers (supervised learning) and questions (unsupervised learning).
Now we have differentiate between those two, we can go into the models. Let's discuss about supervised learning, which basically has 3 main models:
artificial neural network (ANN)
convolutional neural network (CNN)
recurrent neural network (RNN)
ANN is the simplest of all three. I believe that you have understand it, so we can move forward to CNN.
Basically in CNN all you have to do is to convolve our input with feature detectors. Feature detectors are matrices which have the dimension of (row,column,depth(number of feature detectors). The goal of convolving our input is to extract informations related to spatial data. Let's say you want to distinguish between cats and dogs. Cats have whiskers but dogs does not. Cats also have different eyes than dogs and so on. But the downside is, the more convolution layers will result in slower computation time. To mitigate that, we do some kind of processing called pooling or downsampling. Basically, this reduce the size of feature detectors while minimizing lost features or information. Then the next step would be flattening or squashing all those 3d matrix into (n,1) dimension so you can input it into ANN. Then the next step is self explanatory, which is normal ANN. Because CNN is inherently able to detect certain features, it mostly(maybe always) used for classification, for example image classification, time series classification, or maybe even video classification. For a crash course in CNN, check out this video by Siraj Raval. He's my favourite youtuber of all time!
Arguably the most sophisticated of all three, RNN is bestly described as neural networks that have "memory" by introducing "loops" within them which allow information to persist. Why is this important? As you are reading this, your brain use previous memory to comprehend all of this information. You don't seem to rethink everything from scratch again and this is what traditional neural networks do, which is to forget everything and re-learn again. But native RNN aren't effective so when people talk about RNN they mostly refer to LSTM which stands for Long Short-Term Memory. If that seems confusing to you, Cristopher Olah will give you in depth explanation in a very simple way. I advice you to check out his link for complete understanding about how RNN, especially LSTM variant
As for unsupervised learning, I'm so sorry that I haven't got the time to learn them, so this is the best I can do. Good luck and have fun!
They are the same type of Networks. Convolutional Neural Networks. The problem with the overview is that as soon as you post something it is already outdated. Most of the networks you describe are already old, even though they are only a few years old.
Nevertheless you can take a look at the networks supplied by caffe (https://github.com/BVLC/caffe/tree/master/models).
In my personal view the most important concepts in deep Learning are recurrent networks (https://keras.io/layers/recurrent/), residual connections, inception blocks (see https://arxiv.org/abs/1602.07261). The rest are largely theoretical concepts, which would not fit in a stack overflow answer.

Calculating ROC and AUC in Caffe?

I have trained imagenet in Caffe. Now i am trying to calculate ROC/AUC for my model and the trained model provided by caffe. I have two questions:
1) ROC/AUC is mainly used for binary classes, but i also found that in some cases people used it for multi-classes. Is it possible for 1000 classes. And what will be its impact? As in reviews people didn't give good answer for ROC/AUC in multi-class problems.
2) If possible, and comparing two models based on ROC/AUC will be a good idea, Can anybody tell how to do it for these 1000 classes in Caffe? And do i have to retrain the models from scratch, or can i calculate only with final trained models?
Regards
This discussion addresses multi-class ROC/AUC analysis nicely. Answering your questions:
You can do multiple one-vs-all classifications for each class, thus building multiple ROC curves.
Having computed 1000 AUC values, you can come up with the mean AUC over all classes and use this metric to compare goodness of your models. No, you don't need to retrain your models.
Also, pay an attention that ROC/AUC metrics are quite specific and used mostly in detection/biometry tasks like voice identification.