How to use spatial transformer to crop the image in pytorch? - deep-learning

the paper of the spatial transformer network claims that it can be used to crop the image.
Given the crop region (top_left, bottom_right)=(x1,y1,x2,y2), how to interpret the region as a transformation matrix and crop the image in pytorch?
Here is a introduction about the spatial transformer network in torch (http://torch.ch/blog/2015/09/07/spatial_transformers.html), in the introduction, it visualize the bounding box where the transformer look at, How can we determine the bounding box given the transformation matrix?
[Edit]
I just found out the answer to the first question [given the crop region, find out a transformation matrix]

The image in the original post already provides a good answer, but it might be useful to provide some code.
Importantly, this method should retain gradients correctly. In my case I have a batch of y,x values that represent the center of the crop position (in the range [-1,1]). As for the values a and b, which are scale x and y values for the transformation, in my case I used 0.5 for each in combination with a smaller output size (half in width and height) to retain the original scale, i.e. to crop. You can use 1 to have no scale changes, but then there would be no cropping.
import torch.nn.functional as F
def crop_to_affine_matrix(t):
'Turns (N,2) translate values into (N,2,3) affine transformation matrix'
t = t.reshape(-1,1,2,1).flip(2) # flip x,y order to y,x
t = F.pad(t, (2,0,0,0)).squeeze(1)
t[:,0,0] = a
t[:,1,1] = b
return t
t = torch.zeros(5,2) # center crop positions for batch size 5
F.affine_grid(crop_to_affine_matrix(t), outsize)

Related

How Can I Convert Dataset Annaotations To Fixed(YoloV5) Format Without Hand Encoding

So I Am Working On This Awesome Project On Object Detection,Where The Prior Task Is To Identify Brand Logos, So after Doing some research i found this dateset available for the
brand logo For More About Dataset:here
DATASET:
This dateset has 2 versions
FlickrLogos32
FlickrLogos47(recommended for brand detection)
as the name 32 and 47 are the no. of classes offered by this dataset. From the Documentation itself mentioned 47 version is correctly annotated and recommended for object detection & recognization also in my project i have used 47 version
Model:
I Am Using YoloV5 For object detection the
reason behind using YoloV5 and not previous versions is, it it well documented with couple of tutorials with jupyter notebooks available
Problem:
As For The YoloV5:Object Detection Model,The Object Label Should Be Annotated As
<x_center> <y_center> <width> <height> corresponds to bounding box(below image),
whereas the dataset annotations are given in the form of
<x1> <y1> <x2> <y2> where <x1>,<y1>:upper left corner of the bounding box
<x2>,<y2>:lower right corner of the bounding box.
How can i transform <x1>,<y1>,<x2>,<y2>: corner points of bounding box to naive yolo
annotations format i.e.<center_x>,<center_y>,<height>,<width>
without manually going one by one over image and drawing rectangle box with roboflow
Also the Labels are annotated by pixel so we have to normalize it in (0,1)
Datset Insights:
For Any Dataset Example Its Having An Image(.png) and as a Label A Ground truth(.txt)(see below image)
the '.mask' file its just binary mask of object present in image
So A Data Example look likes:
Image:
gt_data.txt:
Mask:
In general to calculate the center it should be xmin + (width/2) and ymin + (height/2). So I think you have you /2 in wrong part of the equation.
Also note that an yolo annotation will look like this.
0.642859 0.079219 0.148063 0.148062
The coordinates are relative to the size of the photo from 0-1. To normalize the coordinates you need to normalize the x dimensions by dividing by the photo width and normalize the y dimensions by dividing by the photo height.

How to crop features outside an image region using pytorch?

We can use ROI-Pool/ROI-Align to crop the sub-features inside an image region (which is a rectangle).
I was wondering how to crop features outside this region.
In other words, how to set values (of a feature map) inside a rectangle region to zero, but values outside the region remains unchanged.
I'm not sure that this idea of ROI align is quite correct. ROI pool and align are used to take a number of differently sized regions of interest identified in the original input space (i.e. pixel-space) and output a set of same-sized feature crops from the features calculated by (nominally) the convolutional network.
As perhaps a simple answer to your question, though, you simply need to create a mask tensor of ones with the same dimension as your feature maps, and set the values within the ROIs to zero for this mask, then multiply the mask by the feature maps. This will suppress all values within the ROIs. Creation of this mask should be fairly simple. I did it with a for-loop to avoid thinking but there's likely more efficient ways as well.
feature_maps # batch_size x num_feature maps x width x height
mask = torch.ones(torch.shape(feature_maps[0,0,:,:]))
for ROI in ROIs: # assuming each ROI is [xmin ymin xmax ymax]
mask[ROI[0]:ROI[2],ROI[1]:ROI[3]] = 0
mask = mask.unsqueeze(0).unsqueeze(0) # 1 x 1 x width x height
mask = mask.repeat(batch_size,num_feature_maps,1,1) # batch_size x num_feature maps x width x height
output = torch.mul(mask,feature_maps)

Loss function for Bounding Box Regression using CNN

I am trying to understand Loss functions for Bounding Box Regression in CNNs. Currently I use Lasagne and Theano, which makes writing loss expressions very easy. Many sources propose different methods and I am asking myself which one is usually used in practice.
The bounding boxes coordinates are represented as normalized coordinates in the order [left, top, right, bottom] (using T.matrix('targets', dtype=theano.config.floatX)).
I have tried the following functions so far; however all of them have their drawbacks.
Intersection over Union
I was adviced to use the Intersection over Union measure to identify how well the 2 bounding boxes align and overlap. However, a problem occurs when the boxes don't overlap and then intersection is 0; then the whole quotient turns 0 regardless of how far the bounding boxes are apart. I implemented it as:
def get_area(A):
return (A[:,2] - A[:,0]) * (A[:,1] - A[:,3])
def get_intersection(A, B):
return (T.minimum(A[:,2], B[:,2]) - T.maximum(A[:,0], B[:,0])) \
* (T.minimum(A[:,1], B[:,1]) - T.maximum(A[:,3], B[:,3]))
def bbox_overlap_loss(A, B):
"""Computes the bounding box overlap using the
Intersection over union"""
intersection = get_intersection(A, B)
union = get_area(A) + get_area(B) - intersection
# Turn into loss
l = 1.0 - intersection / union
return l.mean()
Squared Diameter Difference
To create an error measure for non overlapping bounding boxes, I tried to compute the squared difference of the bounding box diameter. It seems to work, but I almost sure that there is much better way to do this. I implemented it as:
def squared_diameter_loss(A, B):
# Represent the squared distance from the real diameter
# in normalized pixel coordinates
l = (abs(A[:,0:2]-B[:,0:2]) + abs(A[:,2:4]-B[:,2:4]))**2
return l.mean()
Euclidean Loss
The simplest function would be the Euclidean Loss which computes the square root of the difference of the bounding box parameters squared. However, this doesn't take into account the area of the overlapping bounding box but only the difference of the parameters left, right, top, bottom. I implemented it as:
def euclidean_loss(A, B):
l = lasagne.objectives.squared_error(A, B)
return l.mean()
Could someone guide me on which would be the best loss function for bounding box regression for this use case or spot if I am doing something wrong here. Which loss function is usually used in practice?
Speaking from personal implementation experience, I had much better results training a CNN using IOU as the loss function as opposed to Euclidean (MSE or L2) Loss. Have not used the squared diameter difference loss. In general, a loss function that explicitly represents the goodness of your outputs for the tasks you hope to accomplish is probably best.
With regards to the IOU having a value of zero, you can introduce some additional term in the formulation so that it gracefully trends towards 0, perhaps based on normalized distance between bbox centers. This might give the additional effect of helping to center bounding boxes relative to the ground truth.
This response is mostly conceptual but I'd be happy to supply code examples if desired.

Stage3D, AGAL - vertices' and textures' coordinate systems

I've been trying to work with more complicated shaders, and have run into issues with the coordinate systems used by the vertex shader and texture sampler. In short: they don't seem to make any sense, and when trying to test them I end up getting inconsistent results. To make matters worse, the internet has little in the way of documentation, and most of the information I've found seems to expect me to know how this works already. I was hoping someone could clarify the following:
The vertex shaders pass an (x, y, z) representing a location on the render target. What are acceptable values for x, y, and z?
How do x and y correspond to the width and height of the back buffer (assuming that it's the render target)?
How do x and y correspond to the width and height on an output texture (assuming that it's the render target)?
When x=0 and y=0 where does the vertex sit, location-wise?
The texture samplers sample a texture at a (u, v) coordinate. What are acceptable values for u and v?
How do u and v correspond with the width and height of the texture being sampled?
How do AGAL's wrap, clamp, and repeat flags alter sampling, and what is the default behavior when one isn't given?
when sampling at u=0 and v=0, which pixel is returned location-wise?
EDIT:
From my tests, I believe the answers are:
Unsure
-1 is left/bottom, 1 is right/top
Unsure
At the center of the output
Unsure
0 is left/bottom, 1 is right/top
Unsure
The far bottom-left of the texture
You normally use the coordinate system of your own and then multiply the position of each vertex by MVP (model-view-projection) matrix to get NDC coordinates that can be fed to GPU as an output of vertex shader. There is a nice article explaining all that for Stage3D.
Correct. And z is in range [0, 1]
Rendering to a render target is the same as rendering to backbuffer - you output NDC from your vertex shader so the real size of the texture is irrelevant.
Yup, center of the screen.
Normally, it`s [0, 1] but you can use values that go out of that range and then the output depends on texture wrap mode (like repeat or clamp) set on the sampler.
(0, 0) is left/top, (1, 1) is right/bottom.
Default one is repeat. Those modes decide what you will get when you sample using coordinate that is out of range of [0, 1]. With repeat [1.5, 1.5] will result in [0.5, 0.5] while [1.0, 1.0] will be the result if the mode is set to clamp.
Top-left pixel of the texture.

Camera Calibration Matrix how to?

With this toolbox I was performing calibration of my camera.
However the toolbox outputs results in matrix form, and being a noob I don't really understand mathy stuff.
The matrix is in the following form.
Where R is a rotation matrix, T is a translation vector.
And these are the results I got from the toolbox. It outputs values in pixels.
-0.980755 -0.136184 -0.139905 217.653207
0.148552 -0.055504 -0.987346 995.948880
0.126695 -0.989128 0.074666 371.963957
0.000000 0.000000 0.000000 1.000000
Using this data can I know how much my camera is rotated and distance of it from the calibration object?
The distance part is easy. The translation from the origin is given by the first three numbers in the rightmost column. This represents the translation in the x, y, and z directions respectively. In your example, the camera's position p = (px, py, pz) = (217.653207, 995.948880, 371.963957). You can take the Euclidean distance between the camera's location and the location of the calibration object (cx, cy, cz). That is it would just be sqrt( (px-cx)2 + (py-cy)2 + (pz-cz)2 )
The more difficult part regards the rotation which is captured in the upper left 3x3 elements of the matrix. Without knowing exactly how they arrived at this, you're somewhat out of luck. That is, it's not easy to convert that back to Euler Angles, if that's what you want. However, you can transform those elements into a Quaternion Rotation which will give you the unique unit vector and angle to rotate the camera to that orientation. The specifics of the computation are provided here. Once you have the Quaternion rotation, you can easily apply it to the vectors n = (0, 0, 1), up = (0, 1, 0) and right = (1, 0, 0) to get the normal (direction the camera is pointed), up and right vectors. The right vector is only useful if you are interested in slewing the camera left or right from its current position.
I'm guessing the code uses the 'standard' formation - then you will find more details in the opencv library docs or their book.