I am trying to solve Baby detection with unet segmentation model. I already collected baby images, baby segments and also passed the adult images as negative (created black segments for this).
So if I will do in this way is unet model can differentiate the adults and babies? if not what I have to do next?
It really depends on your dataset.
During the training, Unet will try to learn specific features in the images, such as baby's shape, body size, color, etc. If your dataset is good enough (e.g. contains lots of babies examples and lots of adults with a separate color and the image dimensions are not that high) then You probably won't have any problems at all.
There is a possibility however, that your model misses some babies or adults in an image. To tackle this issue, There are a couple of things you can do:
Add Data Augmentation techniques during the training (e.g. random crop, padding, brightness, contrast, etc.)
You can make your model stronger by replacing Unet model with a new approach, such as Unet++ or Unet3+. According to Unet3+ paper, it seems that it is able to outperform both Unet & Unet++ in medical image segmentation tasks:
https://arxiv.org/ftp/arxiv/papers/2004/2004.08790.pdf
Also, I have found this repository, which contains a clean implementation of Unet3+, which might help you get started:
https://github.com/kochlisGit/Unet3-Plus
Related
I am implementing a keypoint detection algorithm to recognize biomedical landmarks on images. I only have one type of landmark to detect. But in a single image, 1-10 of these landmarks can be present. I'm wondering what's the best way to organize the ground truth to maximize learning.
I considered creating 10 landmark coordinates per image and associate them with flags that are either 0 (not present) or 1 (present). But this doesn't seem ideal. Since the multiple landmarks in a single picture are actually the same type of biomedical element, the neural network shouldn't be trying to learn them as separate entities.
Any suggestions?
One landmark that can appear everywhere sounds like a typical CNN problem. Your CNN filters should learn which features make up the landmark, but they don't care where it appears. That would be the responsibility of the next layers. Hence, for training the CNN layers you can use a monochrome image as the target: 1 is "landmark at this pixel", 0 if not.
The next layers are basically processing the CNN-detected features. To train those, your ground truth should be basically the desired outcome. Do you just need a binary output (count>0)? A somewhat accurate estimate of the count? Coordinates? Orientation? NN's don't care that much what they learn, so just give it in training what it should produce in inference.
I want to retrain the object detector Yolov4 to recognize figures of the board game Ticket to Ride.
While gathering pictures i was searching for an idea to reduce the amount of needed pictures.
I was wondering if more instances of an object/class in a picture means more "training per picture" which leads to "i need less pictures"
Is this correct? If not could you try to explain in simple terms?
On the roboflow page, they say that the YOLOv4 breaks detecting objects into two pieces:
regression to identify object positioning via bounding boxes;
classification to classify the objects into classes.
Regression (analysis) is - in short - a method of analysis that tries to find the data (images in your case) that is relevant. Classification - on the other hand - transforms the ‘interesting’ images from the previous step into a class (which is ’train piece’, ’tracks’, ’station’ or something else that is worth separating from the rest).
Now, to answer your question: “no, you need more pictures.” When taking more pictures, YOLOv4 is using more samples make / test a more accurate classification. Yet, you have to be careful what you want to classify. You do want the algorithm to extract a ’train’ class from an image, but not an ‘ocean’ class for example. To prevent this, make more (different) pictures of the classes you want to have!
I'm working on a model to identify bodies of water in satellite imagery. I've modified this example a bit to use a set of ~600 images I've labeled, and it's working quite well for true positives - it produces an accurate mask for imagery tiles with water in it. However, it produces some false-positives as well, generating masks for tiles that have no water in them whatsoever - tiles containing fields, buildings or parking lots, for instance. I'm not sure how to provide this sort of negative feedback to the model - adding false-positive images to the training set with an empty mask is having no effect, and I tried a training set made up of only false-positives, which just produces random noise, making me think that empty masks have no effect on this particular network.
I also tried training a binary classification network from a couple of examples I found to classify tiles as water/notwater. It doesn't seem to be working with a good enough accuracy to use a first-pass filter, with about 5k images per class. I used OSM label-maker for this, and the image sets aren't perfect - there are some water tiles in the non-water set and vice-versa, but even the training set isn't getting good accuracy (~.85 at best).
Is there a way to provide negative feedback to the binary image segmentation model? Should I use a larger training set? I'm kinda stuck here without an ability to provide negative feedback, and would appreciate any pointers on how to handle this.
Thanks!
What are common techniques for finding which parts of images contribute most to image classification via convolutional neural nets?
In general, suppose we have 2d matrices with float values between 0 and 1 as entires. Each matrix is associated with a label (single-label, multi-class) and the goal is to perform classification via (Keras) 2D CNN's.
I'm trying to find methods to extract relevant subsequences of rows/columns that contribute most to classification.
Two examples:
https://github.com/jacobgil/keras-cam
https://github.com/tdeboissiere/VGG16CAM-keras
Other examples/resources with an eye toward Keras would be much appreciated.
Note my datasets are not actual images, so using methods with ImageDataGenerator might not directly apply in this case.
There are many visualization methods. Each of these methods has its strengths and weaknesses.
However, you have to keep in mind that the methods partly visualize different things. Here is a short overview based on this paper.
You can distinguish between three main visualization groups:
Functions (gradients, saliency map): These methods visualize how a change in input space affects the prediction
Signal (deconvolution, Guided BackProp, PatternNet): the signal (reason for a neuron's activation) is visualized. So this visualizes what pattern caused the activation of a particular neuron.
Attribution (LRP, Deep Taylor Decomposition, PatternAttribution): these methods visualize how much a single pixel contributed to the prediction. As a result you get a heatmap highlighting which pixels of the input image most strongly contributed to the classification.
Since you are asking how much a pixel has contributed to the classification, you should use methods of attribution. Nevertheless, the other methods also have their right to exist.
One nice toolbox for visualizing heatmaps is iNNvestigate.
This toolbox contains the following methods:
SmoothGrad
DeConvNet
Guided BackProp
PatternNet
PatternAttribution
Occlusion
Input times Gradient
Integrated Gradients
Deep Taylor
LRP
DeepLift
If I understand go game correctly, there is a board of 19x19. In the AlphaGo Nature paper, http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html, it mentioned convolutional network. My understanding of convolutional networks are examples in image recognitions. Then how could a convolutional network be applied to this problem? Isn't it an overkill to transform the board into a 19x19 image?
Go is influenced a lot by patterns, and as you might have noticed in image classification, convolutional networks are good at those.
You ask if it is an overkill to change a go board into a 19*19 image, i have to admit i have not tried to create an image of it with say 0 for black stone, 0.5 for no stone and 1 for a white stone and train a network with it but i am pretty sure it will work to some extend.
Things are more extreme than that! the 19*19 go board is converted into a 19*19*48 input tensor. (as an rgb image it would only be 19*19*3)
one plane for black stones
one plane for white stones
one plane for empty plaves
and 45 other planes encoding several values whom are helpful for the network to know. (things like, liberties, atari, liberties after move, they are all in the paper but you have to know a little more about go to understand them)
is this an overkill, definitely not! convolutional networks are good at recognizing patterns but they need the right information to do so. for example a ladder is impossible for this network to detect as it is not possible to get that information from one side of the board to the other one and back within the 13convolutional layers used, so some of the 48input planes are used to tell the network if a certain move is a ladder capture or a ladder escape move.