Transfer learning from image classifiers (e.g. AlexNet) to to binary classification of generated data charts - deep-learning

I have written code that transforms time series data (medical EEGs) into a scatter chart. I have labels available for a training set to note the presence of absence of 'seizures' in the data. I can see the seizures clearly (as a human) in the scatter plots. I thought it would be straightforward to adapt a pre-trained image classifier (AlexNet) to do the (seizure, no_seizure) binary classification. I have a training set of 500 chart images. The model is not converging.
I replaced the final Alexnet layer before training:
model.classifier[6] = torch.nn.Linear(model.classifier[6].in_features, 2)
Do you have any advice for helping me with this challenge? Intuitively I thought classification of scatter charts would be easier than photograph image classification.

Related

How can you increase the accuracy of ResNet50?

I'm using Resnet50 model to classify images into two classes: normal cells and cancer cells.
so I want to to increase the accuracy but i don't know what to modify.
# we are using resnet50 for transfer learnin here. So we have imported it
from tensorflow.keras.applications import resnet50
# initializing model with weights='imagenet'i.e. we are carring its original weights
model_name='resnet50'
base_model=resnet50.ResNet50(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
last_layer=base_model.output # we are taking last layer of the model
# Add flatten layer: we are extending Neural Network by adding flattn layer
flatten=layers.Flatten()(last_layer)
# Add dense layer
dense1=layers.Dense(100,activation='relu')(flatten)
# Add dense layer to the final output layer
output_layer=layers.Dense(class_count,activation='softmax')(flatten)
# Creating modle with input and output layer
model=Model(inputs=base_model.inputs,outputs=output_layer)
model.compile(Adamax(learning_rate=.001), loss='categorical_crossentropy', metrics=['accuracy'])
There were 48 errors in 534 test cases Model accuracy= 91.01 %
Also what do you think about the results of the graph?
this is the classification report
i got good results but is there a possibility to increase accuracy more than that?
This is a broad question as there are many ways one can attempt to generally improve the network's accuracy. some of which may be
Increase the dimension of the layers that are learned in transfer learning (make sure not to overfit)
Use transfer learning with Convolution layers and not MLP
let the optimization algorithm choose the learning rate on its own
Play with additional augmentations to the dataset
and the list goes on.
Also, if possible, I would suggest comparing your results to other publicly available benchmarks - by doing so you might understand the upper bounds of the accuracies better

How to train on single image depth estimation on KITTI dataset with masking method

I'm studying on a deep learning(supervised-learning) to estimate depth images from monocular images.
And the dataset currently uses KITTI data. RGB images (input image) are used KITTI Raw data, and data from the following link is used for ground-truth.
In the process of learning a model by designing a simple encoder-decoder network, the result is not so good, so various attempts are being made.
While searching for various methods, I found that groundtruth only learns valid areas by masking because there are many invalid areas, i.e., values that cannot be used, as shown in the image below.
So, I learned through masking, but I am curious about why this result keeps coming out.
and this is my training part of code.
How can i fix this problem.
for epoch in range(num_epoch):
model.train() ### train ###
for batch_idx, samples in enumerate(tqdm(train_loader)):
x_train = samples['RGB'].to(device)
y_train = samples['groundtruth'].to(device)
pred_depth = model.forward(x_train)
valid_mask = y_train != 0 #### Here is masking
valid_gt_depth = y_train[valid_mask]
valid_pred_depth = pred_depth[valid_mask]
loss = loss_RMSE(valid_pred_depth, valid_gt_depth)
As far as I can understand, you are trying to estimate depth from an RGB image as input. This is an ill-posed problem since the same input image can project to multiple plausible depth values. You would need to integrate certain techniques to estimate accurate depth from RGB images instead of simply taking an L1 or L2 loss between an RGB image and its corresponding depth image.
I would suggest you to go through some papers in estimating depth from single images such as: Depth Map Prediction from a Single Image using a Multi-Scale Deep Network where they use a network to first estimate the global structure of the given image and then use a second network that refines the local scene information. Instead of taking a simple RMSE loss, as you did, they use a scale-invariant error function in which the relationship between points is measured.

Should we train the original data point when we do data augmentation?

I am confused about the definition of data augmentation. Should we train the original data points and the transformed ones or just the transformed? If we train both, then we will increase the size of the dataset while the second approach won't.
I got this question when using the function RandomResizedCrop.
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
If we resize and crop some of the dataset randomly, we don't actually increase the size of the dataset for data augmentation. Is that correct? Or data augmentation just requires the change/modification of original dataset rather than increase the size of it?
Thanks.
By definition, or at least the influential paper AlexNet from 2012 that popularized data augmentation in computer vision, increases the size of the training set. Hence the word augmentation. Go ahead and have a look at Section 4.1 from the AlexNet paper. But, here is the gist of it, which I'm quoting from the paper:
The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations. The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent.
As for the specific implementation, it depends on your use case and most importantly the size of your training data. If you're short in quantity in the latter case, you should consider training on original and transformed images, by sufficiently taking care to preserve the labels.
transform.compose is just as preprocessing an image ,convert one form to particular suiatable form
On apply transformation on single image means changing it pixel value and it does not increases dataset size
for more dataset you have to perform operation such below:
final_train_data = []
final_target_train = []
for i in tqdm(range(train_x.shape[0])):
final_train_data.append(train_x[i])
final_train_data.append(rotate(train_x[i], angle=45, mode = 'wrap'))
final_train_data.append(np.fliplr(train_x[i]))
final_train_data.append(np.flipud(train_x[i]))
final_train_data.append(random_noise(train_x[i],var=0.2**2))
for j in range(5):
final_target_train.append(train_y[i])
for more detail pytorch

Where should I put the previous predicted sequence in LSTM for optical character recognition systems

I am trying to build an optical character recognition system that can recognize handwritten sentences using the LSTM cell.
Now what I have understood from literature is that you need to give two inputs to the LSTM cell: one is the image that you are trying to recognize and the second is the sequence of words it has already predicted. So for example if I had an image that read "I love machine learning", I would create the following pairs of input:
Image + startseq
Image + startseq + I
Image + startseq + I + love
So for each input you want the LSTM to predict the next word i.e. I, love, machine for the above sequences.
The problem that I'm having is that I can't figure out how to input the image AND the previous sequence to the LSTM cell. Do I divide my image (a 2-D matrix) into row/column vectors and send them to the LSTM one at a time and after I'm done with that I send in the previous sequence of words? But this way I'll have quite long input sequences which might lead to large converging times.
I know image captioning tasks vectorize input images using pretrained neural nets but can that be done for optical character recognition systems, i.e. would that cause accuracy issues?
No, you don't have to feed recognized words back into LSTM. You only feed an input (feature) sequence and the LSTM learns to propagate relevant information through this sequence.
You should think of input sequence and output sequence when talking about Recurrent Neural Networks (RNNs).
The input to a RNN at time-step t is:
state of memory cell at t-1
input element at t
LSTM has a more advanced internal structure than a vanilla RNN to allow more robust training. But from a user perspective, it works just like a vanilla RNN. You input a sequence and LSTM computes a output sequence for you.
When doing handwriting recognition, you usually extract a feature sequence from the input image (e.g. by using convolutional layers).
Then, you feed this feature sequence into LSTM layers.
You map the output sequence to a character-probability matrix which is then decoded into the final text by the CTC layer.
Here is a short tutorial how to build a handwriting recognition system, it should give you an idea of which data (see "Data": "CNN output" and "RNN output") flows into LSTM and which data flows out of LSTM:
https://towardsdatascience.com/2326a3487cd5

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.