Generate photos based on over 1M protos processed by ourselves before - deep-learning

We are running a huge team that process child photos for our customers, the team processes over 1M photos per year.
The process includes basic tuning of light, resize, apply some filters to make the skin looks better.
We want to use deep learning to complete the jobs as much as possible. Which means I want to choose one model and train that model using our existing data. And then use the trained model to generate photos by inputing the new unprocessed photos.
Is there existing model that I can make use of, or any papers have covered this scenario?
Any help would be appreciated, thanks!

You could try something like this: https://arxiv.org/pdf/1412.7725.pdf. But with deep learning and your amount of training data you can problem get any big enough model to work well.

Image generation is not what you should search for. Image generation means that an image is generated (almost) completely from nothing. You want to enhance an existing image.
Although I haven't read any papers about this scenario so far, searching for "image enhancement neural network" reveald several promising results:
A Survey on Image Enhancement Techniques: Classical Spatial Filter, Neural Network, Cellular Neural Network, and Fuzzy Filter: http://ieeexplore.ieee.org/document/4237993/
A new class of nonlinear filters for image enhancement: http://ieeexplore.ieee.org/document/150915/
An image enhancement technique combining sharpening and noise reduction: http://ieeexplore.ieee.org/document/1044761/
I guess you could do the following:
Create a CNN model. The only "special" thing of this model is that it does not have a fully connected layer as target, but another (3 channel) image. You have to adjust the error function to this. (Similar to semantic segmentation).

Related

Creating a dataset of images for object detection for extremely specific task

Even though I am quite familiar with the concepts of Machine Learning & Deep Learning, I never needed to create my own dataset before.
Now, for my thesis, I have to create my own dataset with images of an object that there are no datasets available on the internet(just assume that this is ground-truth).
I have limited computational power so I want to use YOLO, SSD or efficientdet.
Do I need to go over every single image I have in my dataset by my human eyes and create bounding box center coordinates and dimensions to log them with their labels?
Thanks
Yes, you will need to do that.
At the same time, though the task is niche, you could benefit from the concept of transfer learning. That is, you can use a pre-trained backbone in order to help your model to learn faster/achieve better results/need fewer annotations example, but you will still need to annotate the new dataset on your own.
You can use software such as LabelBox, as a starting point, it is very good since it allows you to output the format in Pascal(VOC) format, YOLO and COCO format, so it is a matter of choice/what is more suitable for you.

Semantic segmentation without labels in a single class

I am kind of new to semantic segmentation. I am trying to perform segmentation of images having defects.
I have the defect images annotated using a annotation tool and I created the mask for each image. I wanted to predict If an image has defect and where exactly it is located. But my problem is my defects does not look same in all the images. Example: Defects on steel- Steel breakage, erroded surface etc. I am just trying to classify if the image has defect or not and where it is located. So is it wrong to train the neural network with these all types considered as defects even though not everything lookalike?
I thought to do a binary segmentation of defect to no defect. If I am not correct how can I perform segmentation for defect and non defect images?
You first have to well define your problem and your objectives:
If you only want to detect if your image has a defect or not, it's a binary classification problem and you affect a label (0 or 1) to each image.
If you want to localise the defect approximatively (like a bounding box), it's an object detection problem and it can be realised with one or more classes.
If you want to localise precisely the defect (in order to performe measures for instance) the best is semantic segmentation or instance segmentation.
If you want to classify the defect, you will need to create classes for each defect you want to classify.
There is no magical solution because it depends of the objectives of your project. I can give you the following advices because I made an internship on a similar project :
Look carefully at your data, if you have thousands of images it will take a long to create your semantic segmentation dataset. Be smarter by using data augmentation techniques.
If you want to classify the defects, be sure to have enough defects of each type to train your network. If your network only sees one defect type per epoch, it can't learn to detect it.
Be sure that your network can detect the defects you're providing (not a scratch of two pixels for instance or alignement defects).
Performing semantic segmentation to only knows if there is a defect or not seems overkill because it's a long and complex process (rebuilding the image, memory of intermediaries images in Unet, lot of computations). If you really want to apply this method, you may create a threshold to detect if the number of detected pixels as defect allows to classify the image as 'presenting a defect' or not.
One class should be enough for your use-case. If you want to be able to distinguish between different types of defects though, you could try creating attributes for that class. So the class would be if a pixel has a defect or not, and the attribute would be breakage, eroded pixel, etc. Then you could train a model to detect a crack on the semantic class and another one to identify which type of defect it is.
Make sure to use an annotation tool that supports creating attributes. Personally, I use hasty.ai as their automation assistants are great! But I guess most tools should be able to do so.

Camera image recognition with small sample set

I need to visually recognise some flat pictures showed to camera. There are not many of them (maybe 30) but discrimination may depend on details. The input may be partly obscured or shadowed and is suspect to lighting changes.
The samples need to be updatable.
There are many existing frameworks for object detection, with the most reliable ones depending on deep learning methods (mostly convolutional networks). However, the pretrained models are not well optimised to discern flat imagery of course, and even if I start training from scratch, updating the system for new samples would take a cumbersome training process, if I am right about how this works.
Is it possible to use deep learning while still keeping the sample pool flexible?
Is there any other well known reliable method to detect images from a small sample set?
One can use well trained networks for visual classification like Inception or SqueezeNet, slice of the last layer(s) and add a simple statistical algorithm (for example k-nearest neighbour) that can be directly teached by the samples in a non-iterative fashion.
Most classification-related calculations like lighting and orientation insensitivity are already handled by the pre-trained network then, while the network's output keep enough information to allow statistical algorithms decide the image class.
An implementation using k-nearest neighbour is shown here: https://teachablemachine.withgoogle.com/ , the source is hosted here: https://github.com/googlecreativelab/teachable-machine .
Use transfer learning; you’ll still need to build a training set, but you’ll get better results than starting with random weights. Try to find a model trained on images similar to yours. You might also do some black box testing of the selected model with your curated images to baseline it’s response curve to your images.

Caffe training uses face crop but deploy uses full image

I'm implementing this project and it is working fine. Now I wonder how is it possible that the training phase uses only a face crop of the image, but actual use can accept a full image with multiple people.
The model is trained to find a face within an image.
Training with face crops allows the training to converge faster, as it does not go through the trial-and-error to recognize -- and then learn to ignore -- other structures in the input images. The full capacity of the model topology can go toward facial features.
When you get to scoring ("actual use", a.k.a. inference), the model has no training for or against all the other stuff in each photo. It's trained to find faces, and will do that well.
Does that explain it well enough?

Deep Neural Network combined with qlearning

I'm using joint positions from a Kinect camera as my state space but I think it's going to be too large (25 joints x 30 per second) to just feed into SARSA or Qlearning.
Right now I'm using the Kinect Gesture Builder program which uses Supervised Learning to associate user movement to specific gestures. But that requires supervised training which I'd like to move away from. I figure the algorithm might pick up certain associations between joints that I would when I classify the data myself (hands up, step left, step right, for example).
I think feeding that data into a deep neural network and then pass that into a reinforcement learning algorithm might give me a better result.
There was a paper on this recently. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
I know Accord.net has both deep neural networks and RL but has anyone combined them together? Any insights?
If I understand correctly from your question + comment, what you want is to have an agent that performs discrete actions using a visual input (raw pixels from a camera). This looks exactly like what DeepMind guys recently did, extending the paper you mentioned. Have a look at this. It is the newer (and better) version of playing Atari games. They also provide an official implementation, which you can download here.
There is even an implementation in Neon which works pretty well.
Finally, if you want to use continuous actions, you might be interested in this very recent paper.
To recap: yes, somebody combined DNN + RL, it works and if you want to use raw camera data to train an agent with RL, this is definitely one way to go :)