3D annotation for instance segmentation - deep-learning

I'm trying to annotate some data for 3D instance segmentation. While it's fairly straightforward to draw masks for each 2D plane, it's not obvious how to connect the same "instances" together post-annotation (ie. connect the "red" masks together, connect the "blue" masks together) without laboriously making sure the instances are instance-matched (ie. colour-coded to make sure "red" masks always connect with "red" masks).
A naive approach I have thought of is to make many 2D segmentation masks, and calculate the center of mass for each object detected. I can later re-assign the instances based on the closest matching center of mass, but I worry this would inadvertently generate "crossed-over" segmentation instances (illustrated below). What are some high-throughput strategies to generate 3D annotations?

The boundary of your 2-d slices could be used as constraints to obtain the optimal 3-d surface, as proposed in 1.
However, I think it is easier to generate 3-d labels from markers, such as 2. Its implementation is available in here (Fill free open an issue if you encounter any problems :P).
Also, the napari package could be useful to develop the GUI without much effort.
[1] Grady, Leo. "Minimal surfaces extend shortest path segmentation methods to 3D." IEEE Transactions on Pattern Analysis and Machine Intelligence 32.2 (2008): 321-334.
[2] Falcão, Alexandre X., and Felipe PG Bergo. "Interactive volume segmentation with differential image foresting transforms." IEEE Transactions on Medical Imaging 23.9 (2004): 1100-1108.

You can use 3D Slicer's Segment Editor. It is free, open-source, has many built-in tools, and customizable/extensible in Python or C++ (you can plug in your own segmentation method with minimal effort). To solve a segmentation task, typically you first figure out a good segmentation workflow (what tools to use, in what combination and what parameters) using interactive GUI, then if necessary you can make it semi-automatic or fully automatic using Python scripting.
You can create a segmentation by contouring every image slice, but it would be too tedious. Instead, you can use 3D region growing (Grow from seeds effect) or segment on just a few slices and interpolate between them (Fill between slices effect).

Related

What is the ideal steps in using predicted segmentation masks for watershed post processing?

I am experimenting with object segmentation(round shaped objects that are often occur close together). I have used UNET deep neural network architecture for segmentation and obtained segmentation masks. I saved those in npy format.
I am a beginner in this area. I would like to know the ideal steps that I should follow now, if I want to apply watershed on the predicted masks with the aim of separating the objects.
I guess I need to convert the binary mask predicted to some form so that I can obtain some kind of markers indicating centroids.
Please help

Computer vision detection on small object bad results why?

Currently, for detection (localisation + recognition tasks) we use mainly deep learning algorithm in computer vision. Two types of detector exist :
one stage : SSD, YOLO, retinanet, ...
two stage : RCNN, Fast RCNN and faster RCNN for example
Using these detectors on very small objects (10 pixels for example) is a very challenging tasks and it seems the one stage algorithm are worse than the two stage algorithm. But I do not really understand why it works better on Faster RCNN for example. In fact, the one and two stage detector use both of them the anchor concept, and most of them use the same backbone like VGG16 or resnet50/resnet101. That means the receptive fields is the same. For example, I tried to detect very small object on retinanet and on faster RCNN. On retinanet, small object are not detected contrary to faster rcnn. I do not understand why. What is the explication theoretically ? (same backbone : resnet50)
I think in general networks like retinaNet are trying to bridge the gap you mention.Usually in one stage networks we will have anchor boxes of varying scales in the feature maps produced by the Backbone net, These feature maps are produced by heavily down sampling the input image, A lot of information about small object might be lost while performing this operation.While this is the case with one stage detectors, In two stage detectors because of flexibility of the RPN network, The RPN network may still propose regions which are small and this may help it to perform slightly better than its one stage counterparts.
I don't think you should be very surprised that both of these might use the same backbone, After the conv features are extracted both networks use different methods to perform detection.
Hope this helps, Let me know if i wasn't clear enough,or you have questions.

Which is best for object localization among R-CNN, fast R-CNN, faster R-CNN and YOLO

what is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following:
(1) Precision on same image set
(2) Given SAME IMAGE SIZE, the run time
(3) Support for android porting
Considering these three criteria which is the best object localization technique?
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
YOLO:
YOLO uses a single CNN network for both classification and localising the object using bounding boxes. This is the architecture of YOLO :
In the end you will have a tensor of shape 1470 i.e 7*7*30 and the structure of the CNN output will be:
The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at the 49 cells that form the original image.
In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold.
I hope you get the understanding of these networks, to answer your question on how the performance of these network differs:
On the same dataset: 'You can be sure that the performance of these networks are in the order they are mentioned, with YOLO being the best and R-CNN being the worst'
Given SAME IMAGE SIZE, the run time: Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. However researchers have used YOLO for video segmentation and by far its the best and fastest when it comes to video segmentation.
Support for android porting: As far as my knowledge goes, Tensorflow has some android APIs to port to android but I am not sure how these network will perform or even will you be able to port it or not. That again is subjected to hardware and data_size. Can you please provide the hardware and the size so that I will be able to answer it clearly.
The youtube video tagged by #A_Piro gives a nice explanation too.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.
If your are interested in these algorithms you should take a look into this lesson which go through the algoritmhs you named : https://www.youtube.com/watch?v=GxZrEKZfW2o.
PS: There is also a Fast YOLO if I remember well haha !
I have been working with YOLO and FRCNN a lot. To me the YOLO has the best accuracy and speed but if you want to do research on image processing, I will suggest FRCNN as many previous works are done with it, and to do research you really want to be consistent.
For Object detection, I am trying SSD+ Mobilenet. It has a balance of accuracy and speed So it can also be ported to android devices easily with good fps.
It has less accuracy compared to faster rcnn but more speed than other algorithms.
It also has good support for android porting.

Surface mesh to volume mesh

I have a closed surface mesh generated using Meshlab from point clouds. I need to get a volume mesh for that so that it is not a hollow object. I can't figure it out. I need to get an *.stl file for printing. Can anyone help me to get a volume mesh? (I would prefer an easy solution rather than a complex algorithm).
Given an oriented watertight surface mesh, an oracle function can be derived that determines whether a query line segment intersects the surface (and where): shoot a ray from one end-point and use the even-odd rule (after having spatially indexed the faces of the mesh).
Volumetric meshing algorithms can then be applied using this oracle function to tessellate the interior, typically variants of Marching Cubes or Delaunay-based approaches (see 3D Surface Mesh Generation in the CGAL documentation). The initial surface will however not be exactly preserved.
To my knowledge, MeshLab supports only surface meshes, so it is unlikely to provide a ready-to-use filter for this. Volume mesher packages should however offer this functionality (e.g. TetGen).
The question is not perfectly clear. I try to give a different interpretation. According to your last sentence:
I need to get an *.stl file for printing
It means that you need a 3D model that is ok for being fabricated using a 3D printer, i.e. you need a watertight mesh. A watertight mesh is a mesh that define in a unambiguous way the interior of a volume and corresponds to a mesh that is closed (no boundary), 2-manifold (mainly that each edge is shared exactly by two face), and without self intersections.
MeshLab provide tools for both visualizing boundaries, non manifold and self-intersection. Correcting them is possible in many different ways (deletion of non manifoldness and hole filling or drastic remeshing).

How can I create a classifier using the feature map of a CNN?

I intend to make a classifier using the feature map obtained from a CNN. Can someone suggest how I can do this?
Would it work if I first train the CNN using +ve and -ve samples (and hence obtain the weights), and then every time I need to classify an image, I apply the conv and pooling layers to obtain the feature map? The problem I find in this, is that the image I want to classify, may not have a similar feature map, and hence I wouldn't be able to find the distance correctly. As the order of the features may by different in the layer.
You can use the same CNN for classification if you used (for example) the cross entropy loss-(also known as softmax with loss). In this case, you should take the argmax of your last layer (the node with the highest score), and that would be the class given by the network. However, all the architectures used in machine learning would expect at testing time an input similar to those used during training.