I'm implementing this project and it is working fine. Now I wonder how is it possible that the training phase uses only a face crop of the image, but actual use can accept a full image with multiple people.
The model is trained to find a face within an image.
Training with face crops allows the training to converge faster, as it does not go through the trial-and-error to recognize -- and then learn to ignore -- other structures in the input images. The full capacity of the model topology can go toward facial features.
When you get to scoring ("actual use", a.k.a. inference), the model has no training for or against all the other stuff in each photo. It's trained to find faces, and will do that well.
Does that explain it well enough?
Related
I need to visually recognise some flat pictures showed to camera. There are not many of them (maybe 30) but discrimination may depend on details. The input may be partly obscured or shadowed and is suspect to lighting changes.
The samples need to be updatable.
There are many existing frameworks for object detection, with the most reliable ones depending on deep learning methods (mostly convolutional networks). However, the pretrained models are not well optimised to discern flat imagery of course, and even if I start training from scratch, updating the system for new samples would take a cumbersome training process, if I am right about how this works.
Is it possible to use deep learning while still keeping the sample pool flexible?
Is there any other well known reliable method to detect images from a small sample set?
One can use well trained networks for visual classification like Inception or SqueezeNet, slice of the last layer(s) and add a simple statistical algorithm (for example k-nearest neighbour) that can be directly teached by the samples in a non-iterative fashion.
Most classification-related calculations like lighting and orientation insensitivity are already handled by the pre-trained network then, while the network's output keep enough information to allow statistical algorithms decide the image class.
An implementation using k-nearest neighbour is shown here: https://teachablemachine.withgoogle.com/ , the source is hosted here: https://github.com/googlecreativelab/teachable-machine .
Use transfer learning; you’ll still need to build a training set, but you’ll get better results than starting with random weights. Try to find a model trained on images similar to yours. You might also do some black box testing of the selected model with your curated images to baseline it’s response curve to your images.
I'm studying Image Recognition with Tensorflow. I already read about the topic How to retrain Inception's Layer for new categories on Tensorflow.org, which utilize the Inception v3 training model.
Now, I desire to creat my own CNN model in order to compare with Inception v3, but I don't know how can I begin with.
Anyone knows some guides step-by-step on this problem?
I'd appreciate any your suggestion
Thanks in advance
First baby steps
The gold standard for getting started in image recognition is processing MNIST images. Tensorflow has a great tutorial on how to get started and also how to move to convolutional networks.
From there it is a long hard road to compete with Inception without just copying someone else's graph. You'll probably want to get a feel for what the different layers of convolution do. I created a basic Tensorflow Tutorial which contains an example python file that demos different convolution graphs and their resulting accuracy.
Going deeper
After conquering MNIST you'll need a lot of images (you can get them from imageNet) and a lot of GPU (to run all your training) and a software setup so that you can not only run and test your model, but dozens (if not hundreds) of variations to explore your hyper parameters (like learning rate, convolution size, dropout, etc). Remember, it took a team of leading edge Machine Learning experts to create something like Inception, many many months (possibly years) of iteration to find the model they use today, and thousands of CPU/GPU hours.
If you are trying to understand what is going on and what makes a good graph, then trying to recreate Inception is a great idea. If you just want an excellent Image recognition model, then reuse an existing one.
If you are trying to have fun, just do it!
Cheers-
We are running a huge team that process child photos for our customers, the team processes over 1M photos per year.
The process includes basic tuning of light, resize, apply some filters to make the skin looks better.
We want to use deep learning to complete the jobs as much as possible. Which means I want to choose one model and train that model using our existing data. And then use the trained model to generate photos by inputing the new unprocessed photos.
Is there existing model that I can make use of, or any papers have covered this scenario?
Any help would be appreciated, thanks!
You could try something like this: https://arxiv.org/pdf/1412.7725.pdf. But with deep learning and your amount of training data you can problem get any big enough model to work well.
Image generation is not what you should search for. Image generation means that an image is generated (almost) completely from nothing. You want to enhance an existing image.
Although I haven't read any papers about this scenario so far, searching for "image enhancement neural network" reveald several promising results:
A Survey on Image Enhancement Techniques: Classical Spatial Filter, Neural Network, Cellular Neural Network, and Fuzzy Filter: http://ieeexplore.ieee.org/document/4237993/
A new class of nonlinear filters for image enhancement: http://ieeexplore.ieee.org/document/150915/
An image enhancement technique combining sharpening and noise reduction: http://ieeexplore.ieee.org/document/1044761/
I guess you could do the following:
Create a CNN model. The only "special" thing of this model is that it does not have a fully connected layer as target, but another (3 channel) image. You have to adjust the error function to this. (Similar to semantic segmentation).
I have an aerial, and a satellite image and I'm trying to measure the similarity of both images, and get a factor of how similar they are. What should I look into? I checked fine-grained image similarity, but the problem is I don't know what will be the negative image since my two images are specific. So what should I read or check out?
Check Learning a Similarity Metric Discriminatively, with Application to Face Verification
A tutorial on how to implement the Siamese network described in the paper is here
I'm using joint positions from a Kinect camera as my state space but I think it's going to be too large (25 joints x 30 per second) to just feed into SARSA or Qlearning.
Right now I'm using the Kinect Gesture Builder program which uses Supervised Learning to associate user movement to specific gestures. But that requires supervised training which I'd like to move away from. I figure the algorithm might pick up certain associations between joints that I would when I classify the data myself (hands up, step left, step right, for example).
I think feeding that data into a deep neural network and then pass that into a reinforcement learning algorithm might give me a better result.
There was a paper on this recently. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
I know Accord.net has both deep neural networks and RL but has anyone combined them together? Any insights?
If I understand correctly from your question + comment, what you want is to have an agent that performs discrete actions using a visual input (raw pixels from a camera). This looks exactly like what DeepMind guys recently did, extending the paper you mentioned. Have a look at this. It is the newer (and better) version of playing Atari games. They also provide an official implementation, which you can download here.
There is even an implementation in Neon which works pretty well.
Finally, if you want to use continuous actions, you might be interested in this very recent paper.
To recap: yes, somebody combined DNN + RL, it works and if you want to use raw camera data to train an agent with RL, this is definitely one way to go :)