Medical Image Segmentation / Image Segmentation - deep-learning

Total Dataset :- 100 (on case level)
Training :- 76 cases (18000 slices) Validation :- 19 cases (4000 slices) Test :- 5 cases (2000 slices)
I have a dataset that consists of approx. Eighteen thousand images, out of which approx. Fifteen thousand images are of the normal patient and around 3000 images of patients having some diseases. Now, for these 18000 images, I also have their segmentation mask. So, 15000 segmentations masks are empty, and 3000 have patches.
Should I also feed my model (deep learning, i.e., unet with resnet34 backbone) empty masks along with patches?
EDIT:- patches are very small. Currently, I'm getting 57% IoU on validation. When I tested on test cases, I'm getting over-segmenatation.
mask

Related

LSTM: Diffrent y size after splitting X on timesteps

After following some tutorials on LSTM networks, I've decided to put my knowledge in practice by training a LSTM model on my own dataset.
Here is a view of my data:
As you can observe, I have same number of samples and labels.
Let's say that I have 10 samples and 10 labels for those samples and I want to split those samples in 2 timesteps.
After spliting I would have 5 samples, each having 2 timesteps, but I would still have 10 labels.
Am I right?
How you guys deal with this problem?
If I'm trying to feed the data in this form, I will get a "Data cardinality is ambiguous" exception.
In an LSTM, every input sequence has one one label (in the case of simple classification, at least). So in your case you would have your data be two samples of the position, and then a single label.

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Convergence failure while training GAN for 128x128 images

thanks for looking at this question!
I attempted to train a simple DCGAN to generate room designs from a dataset of 215 coloured images of size 128x128. My attempt can be summarised as below:
Generator: 5 deconvolution layers from (100x1) noise input to (128x128x1) grayscale image output
Discriminator: 4 convolution layers from (128x128x1) grayscale image input
Optimizer: Adam at learning rate of 0.002 for both Generator and Discriminator
Batch size: 21 images/batch
Epoch: 100 epochs with 10 batches/epoch
Results:
1. D-loss is close to 0, G-loss is close to 1. After which, I've cut down my discriminator by 2 convolution layers, reduce Adam learning rate to 0.00002, hoping that the discriminator doesn't overpower my generator.
Upon (1), D-loss and G-loss hovers around 0.5 - 1.0. However, the generated image still show noise images even after 100 epochs.
Questions:
Is there something wrong in terms of how I trained my GAN?
How should I modify my approach to successfully train the GAN?
Thank you so much everyone for your help, really looking forward!

my resnet32, vgg16,vgg19, densenet do not converge

I have an interesting problem. I am working on a project which I am trying to classify 15 logos (14 logo + 1 nonlogo class). The dataset is our own. I am using Digits 5 /6 that employs caffe. My caffe is 0.15.14 flavored by NVIDIA.
I have trained it with Alexnet and Googlenet which have been shipped with Digits. The models built by from scratch and finetuning seem ok. (GLT: 90% accuracy, alexnet: 80%) Meanwhile, these models have been created by finetuning the pretrained models (IMAGENET)
My problem is, I wanted to extend my study to cover resnet 32, densenet 121 and vgg16-19. Whenever I try to model these models, their top 1 accuracies get very poor results. (Generally 0) You might guess (as I did) that sources from by building from scratch. However, as far as i know, the model should converge to some limit but i always see a straight line after 2,3 epochs (the line is accuracy line and it is generally 0) the loss value increases to 87 after a few epochs.
I have searched the possible reasons that i may encounter.
1. I changed the weight_filler param to "xavier" nothing has changed.
2. I increased the learning rate but nothing has changed.
3. I even used a pretrained model to finetune vgg16 but it is still the same.
4. I used the cifar10 dataset by upscaling the image sizes to 224*224 and tried but the values are very similar to my logo dataset
I am struggling in finding the correct way. I am not an expert but it seems soo odd to me to have such bad results after having nice ones in alexnet and googlenet.
Why my models do not converge on these recent networks. I need your advices.
btw my training data contains 400 images per class and for non logo class i have collected 1200 non logo images. The validation data contains different numbers of images per logo class and a different 1200 non logo images for validation. So I have totally 5204 training, 579 test(10% of training) and 4729 validation images
Here i am attaching a trainval.txt for my resnet 32 model.
So what is my problem?
Thanks in advance
resnet32_train_val.prototxt

How does testing work in caffe framework?

So basically one splits the database in training/testing. Let's say 2/3 training and the rest is set for testing.
Then in caffe we split our training data in batches of different sizes, let's say that we have 100 batches of 50 images each, so we have 5000 training images. Now let's say that we have 50 testing batches of 50 images each.
Now let' say that caffe did 1 epoch and then test with the testing batches. How does caffe do this?
It takes first training batch and with it, it tries to predict the labels of every testing batch?
Like:
training_batch_1 : testing_batch_1 = accuracy xxxx;
training_batch_1 : testing_batch_2 = accuracy xxxx;
....
training_batch_1 : testing_batch_50 = accuracy xxxx;
And then it extract the mean accuracy for training_batch_1. Then does the same thing with training_batch_2 and so on?
A test simply runs the input vector through a single forward pass of the trained model. Does the top predicted label match the given test value? If so, score 1 point. At the end of the batch, divide total points by batch size, and that's the batch accuracy.
At the end of the testing run, take the mean of the batch accuracies; that's the testing accuracy.
Is that what you needed to know?