I need help. Trying to understand how the math of a deconv layer works. Let's talk about this layer:
layer {
name: "decon"
type: "Deconvolution"
bottom: "conv2"
top: "decon"
convolution_param {
num_output: 1
kernel_size: 4
stride: 2
pad: 1
}
}
So basically this layer is supposed to "upscale" an image by a factor of 2. If I look at the learned weights, I see e.g. this:
-0,0629104823 -0,1560362280 -0,1512266700 -0,0636162385
-0,0635886043 +0,2607241870 +0,2634004350 -0,0603787377
-0,0718072355 +0,3858278100 +0,3168329000 -0,0817491412
-0,0811873227 -0,0312164668 -0,0321144797 -0,0388795212
So far, so good. Now I'm trying to understand how to apply these weights to actually achieve the upscaling effect. I need to do this in my own code because I want to use simple pixel shaders.
Looking at the Caffe code, "DeconvolutionLayer::Forward_cpu" internally calls "backward_cpu_gemm", which does "gemm", followed by "col2im". My understanding of how all this works is this: gemm takes the input image, and multiplies each pixel with each of the 16 weights listed above. So basically gemm produces 16 output "images". Then col2im sums up these 16 "images" to produce the final output image. But due to the stride of 2, it stretches the 16 gemm images over the output image in such a way that each output pixel is only comprised of 4 gemm pixels. Does that sound correct to you so far?
My understand is that each output pixel is calculated from the nearest 4 low-res pixels, by using 4 weights from the 4x4 deconv weight matrix. If you look at the following image:
https://i.stack.imgur.com/X6iXE.png
Each output pixel uses either the yellow, pink, grey or white weights, but not the other weights. Do I understand that correctly? If so, I have a huge understanding problem, because in order for this whole concept to work correctly, e.g. the yellow weights should add up to the same sum as the pink weights etc. But they do not! As a result my pixel shader produces images where 1 out of 4 pixels is darker than the others, or every other line is darker, or things like that (depending on which trained model I'm using). Obviously, when running the model through Caffe, no such artifacts occur. So I must have a misunderstanding somewhere. But I can't find it... :-(
P.S: Just to complete the information: There's a conv layer in front of the deconv layer with "num_output" of e.g. 64. So the deconv layer actually has e.g. 64 4x4 weights, plus one bias, of course.
After a lot of debugging I found that my understanding of the deconv layer was perfectly alright. I fixed all the artifacts by simply dividing the bias floats by 255.0. That's necessary because pixel shaders run in 0-1 range, while the Caffe bias constants seem to be targetted at 0-255 pixel values.
Everything working great now.
I still don't understand why the 4 weight pairs don't sum up to the same value and how that can possibly work. But what do I know. It does work, after all. I suppose some things will always be a mystery to me.
Related
I'm currently working on a Object Detection project using Matterport MaskRCNN.
As part of the job is to detect a Green leaf that crosses a white grid. Until now I have defined the annotation (Polygons) in such a way that every single leaf which crosses the net (and gives white-green-white pattern) is considered a valid annotation.
But, when changing the definition above from single-cross annotation to multi-cross (more than one leaf crossing the net at once), I started to see a serious decrease in model performance during testing phase.
This raised my question - The only difference between the two comes down to size of the annotation. So:
Which of the following is more influential on learning during MaskRCNN's training - pattern or size?
If the pattern is influential, it's better. Because the goal is to identify a crossing. Conversely, if the size of the annotation is the influencer, then that's a problem, because I don't want the model to look for multi-cross or alternatively large single-cross in the image.
P.S. - References to recommended articles that explain the subject will be welcomed
Thanks in advance
If I understand correctly the shape of the annotation becomes longer and more stretched out if going for multicross annotation.
In that case you can change the size and side ratio of the anchors that are scanning the image for objects. With default settings the model often has squarish bounding boxes. This means that very long and narrow annotations create bounding boxes with a great difference between width and height. These objects seem to be harder to segment and detect by the model.
These are the default configurations in the config.py file:
Length of square anchor side in pixels
RPN_ANCHOR_SCALES = (32, 64, 128, 256, 512)
Ratios of anchors at each cell (width/height). A value of 1 represents a square anchor, and 0.5 is a wide anchor
RPN_ANCHOR_RATIOS = [0.5, 1, 2]
You can play around with these values in inference mode and look if it gives you some better results.
I was reading the fast rcnn caffe code. Inside the SmoothL1LossLayer, I found that the implementation is not the same as the paper equation, is that what it should be ?
The paper equation:
For each labeled bounding box with class u, we calculate the sum error of tx, ty, tw, th, but in the code, we have:
There is no class label information used. Can anyone explain why?
And in the backpropagation step,
why there is an i here ?
In train.prototxt bbox_pred has output size 84 = 4(x,y,h,w) * 21(number of label). So does bbox_targets. So it is using all labels.
As for loss layers it is looping over bottom blobs to find which on to propagate gradient through. Here only one of propagate_down[i] is true.
I will be thankful if you answer my question. I am worried I am doing wrong, because my network always gives black image without any segmentation.
I am doing semantic segmentation in Caffe. The output of score layer is <1 5 256 256> batch_size no_classes image_width image_height. Which is sent to SoftmaxWithLoss layer, and the out input of loss layer is the groundtruth image with 5 class labels <1 1 256 256>.
My question is: the dimension of these two inputs of loss layer does not match. Should I create 5 label images for these 5 classes and send a batch_size of 5 in label layer into the loss layer?
How can I prepare label data for semantic segmentation?
Regards
your dimensions are okay. you are outputting 5 vector per pixel indicating the probability of each class. The ground truth is a single label (integer) and the loss encourages the probability of the correct label to be the maximal for the pixel
I have some ActionScript3 code I'm using to create liquid-like "droplets", and when they're first generated they look like a curved square (that's as close as I can get them to being a circle). I've tried and failed a lot here but my goal is to make these droplets look more organic and free-form, as if you were looking closely at rain drops on your windshield before they start dripping.
Here's what I have:
var size:int = (100 - asset.width) / 4,
droplet:Shape = new Shape();
droplet.graphics.beginFill(0xCC0000);
droplet.graphics.moveTo(size / 2, 0);
droplet.graphics.curveTo(size, 0, size, size / 2);
droplet.graphics.curveTo(size, size, size / 2, size);
droplet.graphics.curveTo(0, size, 0, size / 2);
droplet.graphics.curveTo(0, 0, size / 2, 0);
// Apply some bevel filters and such...
Which yields a droplet shaped like this:
When I try adding some randomness to the size or the integers or add more curves in the code above, I end up getting jagged points and some line overlap/inversion.
I'm really hoping someone who is good at math or bezier logic can see something obvious that I need to do to make my consistently rounded-corner square achieve shape randomness similar to this:
First off, you can get actual circle-looking cirles using beziers by using 0.55228 * size rather than half-size (in relation to bezier curves, this is sometimes called kappa). It only applies if you're using four segments, and that's where the other hint comes in: the more points you have, the more you can make your shape "creep", so you might actually want more segments, in which case it becomes easier to simply generate a number of points on a circle (fairly straight forward using good old sine and cosine functions and a regularly spaced angle), and then come up with the multi-segment Catmul-Rom curve through those points instead. Catmul-Rom curves and Bezier curves are actually different representations of the same curvatures, so you can pretty much trivially convert from one to the other, explained over at http://pomax.github.io/bezierinfo/#catmullconv (last item in the section gives the translation if you don't care about the maths). You can then introduce as much random travel as you want (make the upper points a little stickier and "jerk" them down when they get too far from the bottom points to get that sticky rain look)
How can I set the text size (inside TextField) in standart CSS/printable points? According to the manual:
fontSize - Only the numeric part of the value is used. Units (px, pt)
are not parsed; pixels and points are equivalent.
As far as I understand, 1 pixel may be equal to 1 point only in 72 PPI case. So, actionscript just operating pixels (not the real points). My trouble is to get the actual text size that I can print. Any advices or solutions are welcome.
SWF is measured in pixels, moreover, is scalable, so 1 pixel can be 1 point now, 2 points a bit later (scaleY=scaleX=2), and an undefined number another bit later (removed from stage without dereferencing). In short, for AS there are NO "real points" since it does not know a thing about printers, while it knows about displays.