Does detectron2 normalise the bounding box? - deep-learning

I know that detectron2 pefrom normalization for the training and inference image but does it normalise the bounding box as well? Also, for custom dataset, do we need to pass own value for PIXEL_MEAN, PIXEL_STD? Any help would be very much appreciated.
I am expecting that in config file, there should be some option for normalising bounding box. But, not sure!

Related

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.
After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

Bounding box creater

Friends, I have images(more 100) with their ground truths (Their black white masks). How can I get them automatically bound in the pascal voc format, bounding box values, ie xml files.
I mean that creating xmin,xmax,ymin,ymax values from masks and saved them as xml files. I used LabelImg, but there was automatic way, I did not find. I will use them for deep learning pascal voc.
Is there a code, tool or link how to do?
If you want to get bounding box from masks, then you just need to use numpy.where() to get indexes of each max, then you simply get the max and min values of those indexes, and those are exactly the coordonates of the bounding box

can we find the required string in image using CNN/LSTM? or do we need to apply NLP after extracting text using CNN/LSTM. can someone please clarify?

Im building a parser algorithm on images. tesseract not giving accuracy. so im thinking to build a CNN+LSTM based model for image to text conversion. is my approach is the right one? can we extract only the required string directly from CNN_LSTM model instead of NLP? or you see any other ways to improve tesseract accuracy?
NLP is used to allow the network to try and "understand" text. I think what you want here is to see if a picture contains text. For this, NLP would not be required, since you are not trying to get the network to analyze or understand the text. Instead, this should be more of an object detection type problem.
There are many models that do object detection.
Some off the top of my head are YOLO, R-CNN, and Mask R-CNN.

Question about training a object detecting model but the images of train dataset have many miss-annotation objects

in the train dataset, some objects that should to be labeled but didn't cause some reasons.
like the picture below, some objects missed annotation(red rectangle is the labeled one).
image with miss-annotations
what should i do to the uncomplete labeled dataset and what's the effect to the model(maybe overfit to the test data cause false negetive when training)?
Most detection algorithms use portion of images without bounding boxes as examples of "negative" images, meaning images that should not be detected.
If you have many objects in your training set which should have been labeled but aren't, this is a problem because it confuses the training algorithm.
You should definitely think about manually adding the missing labels to the dataset.

How to convert image of handwriting into pen coordinates?

I have a image with binary values (black and white) at each pixel. I want to convert this into an ordered list of pen coordinates (X,Y) which traces the path of the pen. I want to do this so that I can use an API which only takes pen coordinates as an input.
Is there any library or straightforward way to do this? Thanks!
This question https://graphicdesign.stackexchange.com/questions/25165/how-can-i-convert-a-jpg-signature-into-strokes describes using a vector graphic convertor to do this. It suggests first converting the pixels to binary values, and then using the autotrace tool.
autotrace -centerline -color-count 2 -output-file output.svg -output-format SVG input.png
I'm not sure what the best way to get SVG files into (X,Y) coordinates is, you can do a rough job by parsing the XML directly.
Another approach could be to look through the submissions in the Kaggle competition for this same problem: https://www.kaggle.com/c/icdar2013-stroke-recovery-from-offline-data . I haven't looked into these myself, though I imagine they'd result in better performance.