why object detection model predicts offset of object - deep-learning

I'm reading about YOLO series in object detection problems. I'm still confused about why the model always predicts the offset of the object. Can anyone explain it for me? And have an example about the case that the model predicts directly 4 coordinates of the object

Related

Multi output Regression Model

I am working on Airfoil Simulation data. I trained a multioutput regression model using Encode and Decoder and got MSE 8.78. I need help with what else the Model should use. My main target is that it takes less time for training.
Thank You.

How to conduct object detection again in bouding boxes after object detection?

For example, I want to use YoloV5 to detect intestinal cells first and get green boudingbox, and then detect intestinal bulges in green boundingbox.
I want to import boundingboxes from the first yoloV5 model into the second YoloV5 model, but how do I get boundingboxes into the second YoloV5 model?
Or you have other good ideas.
I would train a new Yolov5 model on both classes. By doing so you would only need a single inference to get the results.
If that is not possible I would input the clean image into the different models. Gather the bboxes of the different models and finally plot them on the original image. This is so that the previous bboxes do not impact the inference process negatively as the models has most certainly never seen bounding boxes during training.

What is the ideal steps in using predicted segmentation masks for watershed post processing?

I am experimenting with object segmentation(round shaped objects that are often occur close together). I have used UNET deep neural network architecture for segmentation and obtained segmentation masks. I saved those in npy format.
I am a beginner in this area. I would like to know the ideal steps that I should follow now, if I want to apply watershed on the predicted masks with the aim of separating the objects.
I guess I need to convert the binary mask predicted to some form so that I can obtain some kind of markers indicating centroids.
Please help

6D pose estimation of a known 3D CAD object with limited model training for a new object

I'm working on a project where I need to estimate the 6DOF pose of a known 3D CAD object in a single RGB image - i.e. this task: https://paperswithcode.com/task/6d-pose-estimation. There are several constraints on the problem:
Usable commercially (licensed under BSD, MIT, BOOST, etc.), not GPL.
The CAD object is known and we do NOT aim for generality (i.e.recognize the class of all chairs).
The CAD object can be uploaded by a user, so it may have symmetries and a range of textures.
Inference step will be run on a smartphone, and should be able to run at >30fps.
The inference step can either be a) find the pose of the object once and then I can write code to continue to track it or b) find the pose of the object continuously. I.e. the model doesn't need to have any continuous refinement steps after the initial pose estimate is found.
Can be anywhere on the scale of single instance of a single object to multiple instances of multiple objects (MiMo). MiMO is preferred, but not required.
If a deep learning approach is used, the training time required for a new CAD object should be on the order of hours, not days.
Can either 1) just find the initial pose of an object and not have any refinement steps after or 2) find the initial pose of the object and also have refinement steps after.
I am open to traditional approaches (i.e. 2D->3D correspondences then solving with PnP), but it seems like deep learning approaches outperform them (classical are too slow - Real time 6D pose estimation of known 3D CAD objects from a single 2D image or point clouds from RGBD Camera when objects are one on top of the other?). Looking at deep learning approaches (poseCNN, HybridPose, Pix2Pose, CosyPose), it seems most of them match these constraints, except that they require model training time. Though perhaps I can use a single pre-trained model and then specialize it for each new CAD object with a shorter training step. But I am not sure of this, and I think success probably relies on the specific model chosen. For example, this project says it requires 3 hours of training time: https://github.com/DLR-RM/AugmentedAutoencoder.
So, my question: would somebody know what the state of the art, commercially usable implementation that doesn't require extensive training time for a new CAD object is?

Object Detection from Image Classification

Can I use a model used for Image Classification to do Object Detection? Already I spent too much time doing the image collection and distribute each class into it folder.
You can use your classification model as an initialized backbone for a detection model (e.g. Faster-RCNN) but it might not help that much compared to train your detector from scratch.
You will need to add detection layers (e.g. ROI pooling) to your backbone to perform detection.
While you can try unsupervised object detection, usually you will need extra labels such as object bounding-boxes to train your object detector.