How to creat CNN model in Image Recognition with Tensorflow to compare with Inception v3 - deep-learning

I'm studying Image Recognition with Tensorflow. I already read about the topic How to retrain Inception's Layer for new categories on Tensorflow.org, which utilize the Inception v3 training model.
Now, I desire to creat my own CNN model in order to compare with Inception v3, but I don't know how can I begin with.
Anyone knows some guides step-by-step on this problem?
I'd appreciate any your suggestion
Thanks in advance

First baby steps
The gold standard for getting started in image recognition is processing MNIST images. Tensorflow has a great tutorial on how to get started and also how to move to convolutional networks.
From there it is a long hard road to compete with Inception without just copying someone else's graph. You'll probably want to get a feel for what the different layers of convolution do. I created a basic Tensorflow Tutorial which contains an example python file that demos different convolution graphs and their resulting accuracy.
Going deeper
After conquering MNIST you'll need a lot of images (you can get them from imageNet) and a lot of GPU (to run all your training) and a software setup so that you can not only run and test your model, but dozens (if not hundreds) of variations to explore your hyper parameters (like learning rate, convolution size, dropout, etc). Remember, it took a team of leading edge Machine Learning experts to create something like Inception, many many months (possibly years) of iteration to find the model they use today, and thousands of CPU/GPU hours.
If you are trying to understand what is going on and what makes a good graph, then trying to recreate Inception is a great idea. If you just want an excellent Image recognition model, then reuse an existing one.
If you are trying to have fun, just do it!
Cheers-

Related

Is there an open source solution for Multiple camera multiple object (people) tracking system?

I have been trying to tackle a problem where I need to track multiple people through multiple camera viewpoints on a real-time basis.
I found a solution DeepCC (https://github.com/daiwc/DeepCC) on DukeMTMC dataset but unfortunately, this solution has been taken down because of data confidentiality issues. They were using Fast R-CNN for object detection, triplet loss for Re-identification and DeepSort for real-time multiple object tracking.
Questions:
1. Can someone share some other resources regarding the same problem?
2. Is there a way to download and still use the DukeMTMC database for multiple tracking problem?
3. Is anyone aware when the official website (http://vision.cs.duke.edu/DukeMTMC/) will be available again?
Please feel free to provide different variations of the question :)
Intel OpenVINO framewors has all part of this task:
Objects detection with pretrained Faster RCNN, SSD or YOLO.
Reidentification models.
And complete demo application.
And you can use another models. Or if you want to use detection on GPU then take opencv_dnn_cuda for detection and OpenVINO for reidentification.
A good deep learning library that I have used in the past for my work is called Mask R-CNN, or Mask Regions-Convolutional Neural-Network. Although I have only used this algorithm on images and not on videos, the same principles apply, and it's very easy to make the transition to detection objects in a video. The algorithm uses Tensorflow and Keras, where you can split your input data, i.e images of people, into two sets, training, and validation.
For training, use a third party software like via, to annotate the people in the images. After the annotations have been drawn, you will export a JSON file with all annotations drawn, which will be used for the training process. Do the same thing for the validation phase, BUT make sure the images in the validation have not been seen before by the algorithm.
Once you have annotated both groups and generated JSON files, you then can start training the algorithm. Mask R-CNN makes it very easy to train, with all you need to do is pass one line full of commands to start it. If you want to train data on your GPU instead of your CPU, then install Nvidia's CUDA, which works very well with supported GPUs, and requires no coding after the installation.
During the training stage, you will be generating weights files, which are stored in the .h5 format. Depending on the number of epochs you choose, there will be a weights file generated per epoch. Once the training has finished, you then will just have to reference that weights file anytime you want to detect relevant objects, i.e. in your video feed.
Some important info:
Mask R-CNN is somewhat of an older algorithm, but it still works flawlessly today. Although some people have updated the algorithm to Tenserflow 2.0+, to get the best use out of it, use the following.
Tensorflow-gpu 1.13.2+
Keras 2.0.0+
CUDA 9.0 to 10.0
Honestly, the hardest part for me in the past was not using the algorithm, but finding the right versions of Tensorflow, Keras, and CUDA, that all play well with each other, and don't error out. Although the above-mentioned versions will work, try and see if you can upgrade or downgrade certain libraries to see if you can get better results.
Article about Mask R-CNN with video, I find it to be very useful and resourceful.
https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
The GitHub repo can be found below.
https://github.com/matterport/Mask_RCNN
EDIT
You can use this method across multiple cameras, just set up multiple video captures within a computer vision library like OpenCV. I assume this would be done with Python, which both Mask R-CNN and OpenCV are primarly based in.

Camera image recognition with small sample set

I need to visually recognise some flat pictures showed to camera. There are not many of them (maybe 30) but discrimination may depend on details. The input may be partly obscured or shadowed and is suspect to lighting changes.
The samples need to be updatable.
There are many existing frameworks for object detection, with the most reliable ones depending on deep learning methods (mostly convolutional networks). However, the pretrained models are not well optimised to discern flat imagery of course, and even if I start training from scratch, updating the system for new samples would take a cumbersome training process, if I am right about how this works.
Is it possible to use deep learning while still keeping the sample pool flexible?
Is there any other well known reliable method to detect images from a small sample set?
One can use well trained networks for visual classification like Inception or SqueezeNet, slice of the last layer(s) and add a simple statistical algorithm (for example k-nearest neighbour) that can be directly teached by the samples in a non-iterative fashion.
Most classification-related calculations like lighting and orientation insensitivity are already handled by the pre-trained network then, while the network's output keep enough information to allow statistical algorithms decide the image class.
An implementation using k-nearest neighbour is shown here: https://teachablemachine.withgoogle.com/ , the source is hosted here: https://github.com/googlecreativelab/teachable-machine .
Use transfer learning; you’ll still need to build a training set, but you’ll get better results than starting with random weights. Try to find a model trained on images similar to yours. You might also do some black box testing of the selected model with your curated images to baseline it’s response curve to your images.

Inferring depth from a front facing camera using Deep Reinforcement Learning, ConvNets and RNN's

For a personal project I thought about a Pi Car that drives forward in a loop between the living room and kitchen and is able to steer itself between hallways and avoid collisions.
I was able to create this PoC using Behavioral Cloning. I manually drove the RC car along a black line on the floor while it recorded images.
I then ran the images through a ConvNet and used the model to predict the left and right motor controls. It worked, but left a lot to be desired.
Now I would like to repeat this PoC without manual training. A sonar sensor / LiDAR would work to avoid collision, but I am hoping to learn more about CV
The approach I have in mind is:
1) have the car in continuous forward motion while it records images
2) feed the images into a ConvNet to learn features like how close an object is
3) feed the output of the ConvNet into an RNN
4) the RNN will guide a reinforcement policy
5) the policy is simple: anything blocking your forward motion should be avoided
This is loosely based on this work at Samsung and UC. My thinking is the features learned from the CNN will be used by the RNN in a time series to learn how close an object is.
Think of the car moving closer and closer to the couch. The features of the couch will change; thus hopefully inferring depth. Blocked forward motion would mean objects are getting closer.
One of the issues at now is how can I reflect being blocked by objects in the policy outside of a simulator?
In a ROS simulator it would be easy, since Gazebo gives you x,y coordinates and I can set a rate of change as good forward motion vs. being blocked.
But how can I do this on a physical robot that has no localization?
Also I am not enrolled in any classes and have been following free online material for the last year for all of this. Any critique, feedback and discussion is highly needed!

Generate photos based on over 1M protos processed by ourselves before

We are running a huge team that process child photos for our customers, the team processes over 1M photos per year.
The process includes basic tuning of light, resize, apply some filters to make the skin looks better.
We want to use deep learning to complete the jobs as much as possible. Which means I want to choose one model and train that model using our existing data. And then use the trained model to generate photos by inputing the new unprocessed photos.
Is there existing model that I can make use of, or any papers have covered this scenario?
Any help would be appreciated, thanks!
You could try something like this: https://arxiv.org/pdf/1412.7725.pdf. But with deep learning and your amount of training data you can problem get any big enough model to work well.
Image generation is not what you should search for. Image generation means that an image is generated (almost) completely from nothing. You want to enhance an existing image.
Although I haven't read any papers about this scenario so far, searching for "image enhancement neural network" reveald several promising results:
A Survey on Image Enhancement Techniques: Classical Spatial Filter, Neural Network, Cellular Neural Network, and Fuzzy Filter: http://ieeexplore.ieee.org/document/4237993/
A new class of nonlinear filters for image enhancement: http://ieeexplore.ieee.org/document/150915/
An image enhancement technique combining sharpening and noise reduction: http://ieeexplore.ieee.org/document/1044761/
I guess you could do the following:
Create a CNN model. The only "special" thing of this model is that it does not have a fully connected layer as target, but another (3 channel) image. You have to adjust the error function to this. (Similar to semantic segmentation).

Deep Neural Network combined with qlearning

I'm using joint positions from a Kinect camera as my state space but I think it's going to be too large (25 joints x 30 per second) to just feed into SARSA or Qlearning.
Right now I'm using the Kinect Gesture Builder program which uses Supervised Learning to associate user movement to specific gestures. But that requires supervised training which I'd like to move away from. I figure the algorithm might pick up certain associations between joints that I would when I classify the data myself (hands up, step left, step right, for example).
I think feeding that data into a deep neural network and then pass that into a reinforcement learning algorithm might give me a better result.
There was a paper on this recently. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
I know Accord.net has both deep neural networks and RL but has anyone combined them together? Any insights?
If I understand correctly from your question + comment, what you want is to have an agent that performs discrete actions using a visual input (raw pixels from a camera). This looks exactly like what DeepMind guys recently did, extending the paper you mentioned. Have a look at this. It is the newer (and better) version of playing Atari games. They also provide an official implementation, which you can download here.
There is even an implementation in Neon which works pretty well.
Finally, if you want to use continuous actions, you might be interested in this very recent paper.
To recap: yes, somebody combined DNN + RL, it works and if you want to use raw camera data to train an agent with RL, this is definitely one way to go :)