generalized Dice loss for segmentation for Caffe - caffe

I am struggling to implement the generalized Dice loss for Caffe as Python Layer, which calculates loss for sub-volumes. I am hoping to get some help here. Or at least, if there is any code, please share the link.
I have 5 labels (0: background and labels1:4 for objects). Since I am getting a patch from 3D data, some of the subvolumes only contain the background. How the dice loss should be calculated for this sub-volumes?
Why in this line of code for creating One-hot label, the author has separated the background voxels counting?
Do we calculate the volume overlap for the background voxels too?

Related

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.
After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

Initializing convolutional layers in CNN

Is there a function to initialize weights of a convolutional layer to focus more on information closer to the center of input images?
All my input images are centered, so pixels further away from the center of an image matter less than pixels closer to the center.
Please see the GIFs here for a demonstration of convolutions:
https://github.com/vdumoulin/conv_arithmetic#convolution-animations
As you can see, convolutions operate the same regardless of the position in the image, so weight initialization cannot change the focus of the image.
It is also not advisable to rush into thinking about what the net will and won't need to learn your task. There are sometimes surprising amounts of signal outside what you as a human might focus on. I would suggest training the net and seeing how it performs, and then (as others have suggested) thinking about cropping.
Is there a function to initialize weights of a convolutional layer to focus more on information closer to the center of input images?
This is not possible because, initialization is there just to trigger the process of learning.
Model however, is the one that can have functions, achieving the the attention.
You don't need to initialize conv. layers also because in PyTorch this is already done automatically.

Question about training a object detecting model but the images of train dataset have many miss-annotation objects

in the train dataset, some objects that should to be labeled but didn't cause some reasons.
like the picture below, some objects missed annotation(red rectangle is the labeled one).
image with miss-annotations
what should i do to the uncomplete labeled dataset and what's the effect to the model(maybe overfit to the test data cause false negetive when training)?
Most detection algorithms use portion of images without bounding boxes as examples of "negative" images, meaning images that should not be detected.
If you have many objects in your training set which should have been labeled but aren't, this is a problem because it confuses the training algorithm.
You should definitely think about manually adding the missing labels to the dataset.

Image from concept of a convolutional neural network

Suppose I have a CNN which is trained for classifying images of different animals, the output of such model will be a point (output point) in a n spatial dimension, where n is the number of animal classes the model is trained on; then that output is transformed in a way to convert it into a one-hot vector of n parameters, giving then the correct label for the image from the point of view of the CNN, but let's stick with the n dimensional point, which is the concept of an input image.
Suppose then that I want to take that point and transform it in a way so that the final output is an image with constraint width and height (the dimensions should be the same with different input images) which outputs the same point as the input image's, how do I do that?
I'm basically asking for the methods used (training mostly) for this kind of task, where an image must be reconstructed based on the output point of the CNN -I know the image will never be identical, but I'm looking for images that generate the same (or at least not so different) output point as a input image when that point is inputted to the CNN-. Take in mind that the input of the model I'm asking for is n and the output is a two (or three if it's not in grayscale) dimensional tensor. I noticed that deepdream does exactly this kind of thing (I think), but every time I put "deepdream" and "generate" in Google, an online generator is almost always shown, not the actual techniques; so if there are some answers to this I'd love to hear about them.
The output label does not contain enough information to reconstruct an entire image.
Quoting from the DeepDream example ipython notebook:
Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer.
So the algorithm modifies an existing image such that outputs of certain nodes in the network (can be in an intermediate layer, not necessarily the output nodes) become large. In order to do that, it has to calculate the gradient of the node output with respect to the input pixels.

Using RCNN to autocrop an image

I am new to machine learning. I've been messing around with NVIDIA Digits to train a new dataset. My dataset however is too inaccurate and I think it is because there is too much background in the image that it is getting confused as to what the actual object is. My question:
Is there a way (possibly using RCNN) to crop out the background and then proceed to train using the cropped image? The object is consistent (ex only one object like a singular person but there may be people in the background) and always by itself.