Computer Vision - Preprocessing Recommended Segmentation Method - deep-learning

Given image, purpose: extract only the cattle image, get rid of any other objects (human or another cattle on sight) for training dataset used by CNN script to predict the cattle's weight.
I tried segmentation with pixellib and deeplabv3_xception_tf_dim_ordering_tf_kernels dataset, sometimes I got fragmented images such as:
Any suggestion on recommended better segmentation methods, links related?

Related

Image Classification on heavy occluded and background camouflage

I am doing a project on image classification on classifying various species of bamboos.
The problems on Kaggle are pretty well labeled, singluar and concise pictures.
But the issue with bamboo is they appear in a cluster in most images sometimes with more than 1 species. Also there is a prevalence of heavy occlusion and background camouflage.
Besides there is not much training data available for this problem.
So I have been making my own dataset by collecting the data from the internet and also clicking images from my DSLR.
My first approach was to use a weighted Mask RCNN for instance segmentation and then classifying it using VGGNet and GoogleNet.
My next approach is to test on Attention UNet, YOLO v3 and a new paper BCNet from ICLR 2021.
And then classify on ResNext, GoogleNet and SENet then compare the results.
Any tips or better approach is much appreciated.

Object detection from synthetic to real life data with Yolov5

Currently trying yolov5 with custom synthetic data. The dataset we've created consists of 8 different objects. Each object has a minimum of 1500 pictures/labels, where the pictures are split 500/500/500 of normal/fog/distractors around object. Sample images from the dataset is in the first imgur link. The model is not trained from scratch, but from yolov5 standard .pt.
So far we've tried:
Adding more data (from 300 images per object, to 4500)
Creating more complex data (distractors on/around objects)
Running multiple runs of training
Trained with network size small, medium, large, xlarge
Different batch size between 4-32 (depending on model size)
Everything so far has resulted in good/great detection on synthetic data, but completely off when used on real-life data.
Examples: Thinks that the whole pictures of unrelated objects is a paperbox, walls are pallets, etc. Quick sample images in the last imgur link.
Anyone got clues for how to improve the training or data to be better suited for real life detection? Or how to better interpret the results? I don't understand how the model draws the conclusion that a whole picture, with unrelated objects, is a box/pallet.
Results from training uploaded to imgur:
https://imgur.com/a/P0TQeBl
Example on real life data:
https://imgur.com/a/SGY7w8w
There are couple of things to improve results.
After training your model with synthetic data, fine tune your model with real training data, with a smaller learning rate (1/10th maybe). This will reduce the gap between synthetic and real life images. In some cases rather than fine tuning, training the model with mixed (synthetic+real) produces better results.
Generate images structurally similar to real life examples. For example, put humans inside forklifts, or pallets or barrels on forks, etc. Models learn from it.
Randomize the texture on items that you want to detect. Models tend to focus on textures for detection. By randomizing textures, with lots of variability including mon natural occurrences, you force model to learn to identify objects not based on its textures. Although, texture of an object sometimes is a good identifier, synthetic data suffers from not replicating that feature good enough, hence the domain gap, so you reduce its impact on model decision.
I am not sure whether the screenshot accurately represent your data generation distribution, if so, you have to randomize the angles of objects, sizes and occlusion amounts more.
Use objects that you don’t want to detect but will be in the images you will do inference as distractors, rather than simple shapes like spheres.
Randomize lighting more. Intensity, color, angles etc.
Increase background and ground randomization. Use hdris, there are lots of free hdris
Balance your dataset
https://imgur.com/a/LdCa8aO
Checking your results the answer is that your synthetic data is way to dissimilar to the real life data you want it to work for. Try to generate synthetic scenes that are closer to your real life counterparts and training again would clearly improve your results. That includes more realistic backgrounds and scene compositions. I don't know if your training set resembles the validation images you shared here but in case it does, try to have more objects per image, closer to the camera and add variation to their relative positions. Having just one random 3D object in the middle of an image is not going to provide good results. By the way, you are already overfitting your models, so more training images wouldn't help at this point.

How to combine the probability (soft) output of different networks and get the hard output?

I have trained three different models separately in caffe, and I can get the probability of belonging to each class for semantic segmentation. I want to get an output based on the 3 probabilities that I am getting (for example, the argmax of three probabilities). This can be done by inferring through net model and deploy.prototxt files. And then based on the final soft output, the hard output shows the final segmentation.
My questions are:
How to get ensemble output of these networks?
How to do end-to-end training of ensemble of three networks? Is there any resources to get help?
How to get final segmentation based on the final probability (e.g., argmax of three probabilities), which is soft output?
My question may sound very basic question, and my apologies for that. I am still trying to learn step by step. I really appreciate your help.
There are two ways (at least that I know of) that you could do to solve (1):
One is to use pycaffe interface, instantiate the three networks, forward an input image through each of them, fetch the output and perform any operation you desire to combine all three probabilites. This is specially useful if you intend to combine them using a more complex logic.
The alternative (way less elegant) is to use caffe test and process all your inputs separately through each network saving the probabilities into files. Then combine the probabilities from the files later.
Regarding your second question, I have never trained more than two weight-sharing CNNs (siamese networks). From what I understood, your networks don't share weights, only the architecture. If you want to train all three end-to-end please take a look at this tutorial made for siamese networks. The authors define in their prototxt both paths/branches, connect each branch's layers to the input Data layer and, at the end, with a loss layer.
In your case you would define the three branches (one for each of your networks), connect with input data layers (check if each branch processes the same input or different inputs, for example, the same image pre-processed differently) and unite them with a loss, similarly to the tutorial.
Now, for the last question, it seems Caffe has a ArgMax layer that may be what you are looking for. If you are familiar with python, you could also use a python layer that allows you to define with great flexibility how to combine the output probabilities.

Many challenges to obtain semantic segmentation results for a long time

I did not have any choice except asking here. I have a lot of difficulties for a long time. I have not been to observe any output from FCN32 :(
I trained FCN32 on my data from scratch and always getting a black image. I added gaussian with std= 0.01 initialization for convolutional layers. But still I get black image.
I tried to add weighted loss layers. However, I was not successful to add it correctly. I am not good at python and c++.
My questions:
Is there any correct PR that it can easily include this layer?
My data has 5 classes that the proportion of classes differ from each other in different images. How can I create these weight matrices for each image?
I really appreciate any help. Please share if you know any resource/link/ or if I can get it from other networks' repositories.

Training model to recognize one specific object (or scene)

I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!
There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.