The skip connections allows our gradient all the way from 152nd layers and feed through the initial 1st or 2nd layers of the CNN. But what about the middle layers? Backpropagations in these middle layers are totally irrelevant so are we even learning in resnet?
Backpropagation in these middle layers aren't totally irrelevant. The basic idea of the relevance of the middle layers is that ResNet keeps improving its error-rate when adding new layers (from 5.71 top5 error with 34 layer to 4.49 top5 error with 152). Images have a lot of singularities and complexities, and the folks at Microsoft found out that, when you take care of the vanishing gradient problem (with the feed through) you can gain more knowledge throughout the network with more layers.
The ideia of adding the residual block, it's to prevent the vanishing gradient problem, when you are getting too many layers... But the middle layers are also updated on each training step, and they are also learning (usually high-level features).
Convolutional Neural Networks with lots of layers tend to overfit if the problem isn't too much complex, since its 152 layers have a capacity of learning a lot of different patterns.
Related
I am implementing a keypoint detection algorithm to recognize biomedical landmarks on images. I only have one type of landmark to detect. But in a single image, 1-10 of these landmarks can be present. I'm wondering what's the best way to organize the ground truth to maximize learning.
I considered creating 10 landmark coordinates per image and associate them with flags that are either 0 (not present) or 1 (present). But this doesn't seem ideal. Since the multiple landmarks in a single picture are actually the same type of biomedical element, the neural network shouldn't be trying to learn them as separate entities.
Any suggestions?
One landmark that can appear everywhere sounds like a typical CNN problem. Your CNN filters should learn which features make up the landmark, but they don't care where it appears. That would be the responsibility of the next layers. Hence, for training the CNN layers you can use a monochrome image as the target: 1 is "landmark at this pixel", 0 if not.
The next layers are basically processing the CNN-detected features. To train those, your ground truth should be basically the desired outcome. Do you just need a binary output (count>0)? A somewhat accurate estimate of the count? Coordinates? Orientation? NN's don't care that much what they learn, so just give it in training what it should produce in inference.
I am training a model with Mask RCNN.
I have images where I have two masks with one being completely inside the other. Does this affect the model in any negative way? I have read in the literature that Mask RCNN handles IOU overlapping well.
After YOLO1 there was a trend of using anchor boxes for a while in other iterations as priors (I believe the reason was to both speed up the training and detect different sized objects better)
However YOLOV1 has an interesting mechanism where there are k number of bounding box predictors sliding each grid cell in order to be able to specialize in detecting different scaled objects.
Here is what I wonder, ladies and gentlemen:
Given a very long training time, can these bounding box predictors in YOLOV1 achieve better bounding boxes compared to YOLOV9000 or its counterparts that rely on anchor box mechanism
In my experience, yes they can. I observed two possible optimization paths, one of which is already implemented in latest version of YOLOV3 and V5 by Ultralytics (https://github.com/ultralytics/yolov5)
What I observed was that for a YOLOv3, even before training, using a K means clustering we can ascertain a number of ``common box'' shapes. These data when fed into the network as anchor maskes really improved the performance of the YOLOv3 network for "that particular" dataset since the non-max suppression routine had much better chance of succeeding at filtering out spurious detection for particular classes in each of the detection head. To the best of my knowledge, this technique was implemented in latest iterations of their bounding box regression code.
Suppressing certain layers. In YOLOv3, the network performed detection in three stages with the idea of progressively detecting larger objects to smaller objects. YOLOv3 (and in theory V1) can benefit if with some trial and error, you can ascertain which detection head is your network preferring to use based on the common bounding box shapes that you found in step 1.
I'm working on a model to identify bodies of water in satellite imagery. I've modified this example a bit to use a set of ~600 images I've labeled, and it's working quite well for true positives - it produces an accurate mask for imagery tiles with water in it. However, it produces some false-positives as well, generating masks for tiles that have no water in them whatsoever - tiles containing fields, buildings or parking lots, for instance. I'm not sure how to provide this sort of negative feedback to the model - adding false-positive images to the training set with an empty mask is having no effect, and I tried a training set made up of only false-positives, which just produces random noise, making me think that empty masks have no effect on this particular network.
I also tried training a binary classification network from a couple of examples I found to classify tiles as water/notwater. It doesn't seem to be working with a good enough accuracy to use a first-pass filter, with about 5k images per class. I used OSM label-maker for this, and the image sets aren't perfect - there are some water tiles in the non-water set and vice-versa, but even the training set isn't getting good accuracy (~.85 at best).
Is there a way to provide negative feedback to the binary image segmentation model? Should I use a larger training set? I'm kinda stuck here without an ability to provide negative feedback, and would appreciate any pointers on how to handle this.
Thanks!
If I understand go game correctly, there is a board of 19x19. In the AlphaGo Nature paper, http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html, it mentioned convolutional network. My understanding of convolutional networks are examples in image recognitions. Then how could a convolutional network be applied to this problem? Isn't it an overkill to transform the board into a 19x19 image?
Go is influenced a lot by patterns, and as you might have noticed in image classification, convolutional networks are good at those.
You ask if it is an overkill to change a go board into a 19*19 image, i have to admit i have not tried to create an image of it with say 0 for black stone, 0.5 for no stone and 1 for a white stone and train a network with it but i am pretty sure it will work to some extend.
Things are more extreme than that! the 19*19 go board is converted into a 19*19*48 input tensor. (as an rgb image it would only be 19*19*3)
one plane for black stones
one plane for white stones
one plane for empty plaves
and 45 other planes encoding several values whom are helpful for the network to know. (things like, liberties, atari, liberties after move, they are all in the paper but you have to know a little more about go to understand them)
is this an overkill, definitely not! convolutional networks are good at recognizing patterns but they need the right information to do so. for example a ladder is impossible for this network to detect as it is not possible to get that information from one side of the board to the other one and back within the 13convolutional layers used, so some of the 48input planes are used to tell the network if a certain move is a ladder capture or a ladder escape move.