I am getting confused with the meaning of "backbone" in neural networks, especially in the DeepLabv3+ paper. I did some research and found out that backbone could mean
the feature extraction part of a network
DeepLabv3+ took Xception and ResNet-101 as its backbone. However, I am not familiar with the entire structure of DeepLabv3+, which part the backbone refers to, and which parts remain the same?
A generalized description or definition of backbone would also be appreciated.
In my understanding, the "backbone" refers to the feature extracting network which is used within the DeepLab architecture. This feature extractor is used to encode the network's input into a certain feature representation. The DeepLab framework "wraps" functionalities around this feature extractor. By doing so, the feature extractor can be exchanged and a model can be chosen to fit the task at hand in terms of accuracy, efficiency, etc.
In case of DeepLab, the term backbone might refer to models like the ResNet, Xception, MobileNet, etc.
TL;DR Backbone is not a universal technical term in deep learning.
(Disclaimer: yes, there may be a specific kind of method, layer, tool etc. that is called "backbone", but there is no "backbone of a neural network" in general.)
If authors use the word "backbone" as they are describing a neural network architecture, they mean
feature extraction ( a part of the network that "sees" the input), but this interpretation is not quite universal in the field: for instance, in my opinion, computer vision researchers would use the term to mean feature extraction, whereas natural language processing researchers would not.
in informal language, that this part in question is crucial to the overall method.
Backbone is a term used in DeepLab models/papers to refer to the feature extractor network. These feature extractor networks compute features from the input image and then these features are upsampled by a simple decoder module of DeepLab models to generate segmented masks. The authors of DeepLab models have shown performance with different feature extractors (backbones) like MobileNet, ResNet, and Xception network.
CNNs are used for extracting features. Several CNNs are available, for instance, AlexNet, VGGNet, and ResNet(backbones). These networks are mainly used for object classification tasks and have evaluated on some widely used benchmarks and datasets such as ImageNet. In image classification or image recognition, the classifier classifies a single object in the image, outputs a single category per image, and gives the probability of matching a class. Whereas in object detection, the model must be able to recognize several objects in a single image and provides the coordinates that identify the location of the objects. This shows that the detection of objects can be more difficult than the classification of images.
source and more info: https://link.springer.com/chapter/10.1007/978-3-030-51935-3_30
Related
So I'm getting more and more into deep learning using CNNs.
I was wondering if there are examples of "chained" (I don't know what the correct term would be) CNNs - what I mean by that is, using e.g. a first CNN to perform a semantic segmentation task while using its output as input for a second CNN which for example performs a classification task.
My questions would be:
What is the correct term for this sequential use of neural networks?
Is there a way to pack multiple networks into one "big" network which can be trained in one a single step instead of training 2 models and combining them.
Also if anyone could maybe provide a link so I could read about that kind of stuff, I'd really appreciate it.
Thanks a lot in advance!
Sequential use of independent neural networks can have different interpretations:
The first model may be viewed as a feature extractor and the second one is a classifier.
It may be viewed as a special case of stacking (stacked generalization) with a single model on the first level.
It is a common practice in deep learning to chain multiple models together and train them jointly. Usually it calls end-to-end learning. Please see the answer about it: https://ai.stackexchange.com/questions/16575/what-does-end-to-end-training-mean
I have been trying to tackle a problem where I need to track multiple people through multiple camera viewpoints on a real-time basis.
I found a solution DeepCC (https://github.com/daiwc/DeepCC) on DukeMTMC dataset but unfortunately, this solution has been taken down because of data confidentiality issues. They were using Fast R-CNN for object detection, triplet loss for Re-identification and DeepSort for real-time multiple object tracking.
Questions:
1. Can someone share some other resources regarding the same problem?
2. Is there a way to download and still use the DukeMTMC database for multiple tracking problem?
3. Is anyone aware when the official website (http://vision.cs.duke.edu/DukeMTMC/) will be available again?
Please feel free to provide different variations of the question :)
Intel OpenVINO framewors has all part of this task:
Objects detection with pretrained Faster RCNN, SSD or YOLO.
Reidentification models.
And complete demo application.
And you can use another models. Or if you want to use detection on GPU then take opencv_dnn_cuda for detection and OpenVINO for reidentification.
A good deep learning library that I have used in the past for my work is called Mask R-CNN, or Mask Regions-Convolutional Neural-Network. Although I have only used this algorithm on images and not on videos, the same principles apply, and it's very easy to make the transition to detection objects in a video. The algorithm uses Tensorflow and Keras, where you can split your input data, i.e images of people, into two sets, training, and validation.
For training, use a third party software like via, to annotate the people in the images. After the annotations have been drawn, you will export a JSON file with all annotations drawn, which will be used for the training process. Do the same thing for the validation phase, BUT make sure the images in the validation have not been seen before by the algorithm.
Once you have annotated both groups and generated JSON files, you then can start training the algorithm. Mask R-CNN makes it very easy to train, with all you need to do is pass one line full of commands to start it. If you want to train data on your GPU instead of your CPU, then install Nvidia's CUDA, which works very well with supported GPUs, and requires no coding after the installation.
During the training stage, you will be generating weights files, which are stored in the .h5 format. Depending on the number of epochs you choose, there will be a weights file generated per epoch. Once the training has finished, you then will just have to reference that weights file anytime you want to detect relevant objects, i.e. in your video feed.
Some important info:
Mask R-CNN is somewhat of an older algorithm, but it still works flawlessly today. Although some people have updated the algorithm to Tenserflow 2.0+, to get the best use out of it, use the following.
Tensorflow-gpu 1.13.2+
Keras 2.0.0+
CUDA 9.0 to 10.0
Honestly, the hardest part for me in the past was not using the algorithm, but finding the right versions of Tensorflow, Keras, and CUDA, that all play well with each other, and don't error out. Although the above-mentioned versions will work, try and see if you can upgrade or downgrade certain libraries to see if you can get better results.
Article about Mask R-CNN with video, I find it to be very useful and resourceful.
https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
The GitHub repo can be found below.
https://github.com/matterport/Mask_RCNN
EDIT
You can use this method across multiple cameras, just set up multiple video captures within a computer vision library like OpenCV. I assume this would be done with Python, which both Mask R-CNN and OpenCV are primarly based in.
I wanted to train my own custom Glove representations from using many PDF files. How can i do that ? and is there any way to use the concept of POS tagging and dependency parsing etc? Can you suggest any link for implementing that?
Your question is overbroad to give any tight answers, but of course you can do what you describe.
You'd 1st look into libraries for extracting plain text from PDFs.
Some word2vec projects have trained word-vectors based on word-tokens that have been extended with POS-labels, or dependency-defined contexts, with potential benefits depending on your goals. See for example Levy & Goldberg's paper on dependency-based embeddings:
https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/
I need to visually recognise some flat pictures showed to camera. There are not many of them (maybe 30) but discrimination may depend on details. The input may be partly obscured or shadowed and is suspect to lighting changes.
The samples need to be updatable.
There are many existing frameworks for object detection, with the most reliable ones depending on deep learning methods (mostly convolutional networks). However, the pretrained models are not well optimised to discern flat imagery of course, and even if I start training from scratch, updating the system for new samples would take a cumbersome training process, if I am right about how this works.
Is it possible to use deep learning while still keeping the sample pool flexible?
Is there any other well known reliable method to detect images from a small sample set?
One can use well trained networks for visual classification like Inception or SqueezeNet, slice of the last layer(s) and add a simple statistical algorithm (for example k-nearest neighbour) that can be directly teached by the samples in a non-iterative fashion.
Most classification-related calculations like lighting and orientation insensitivity are already handled by the pre-trained network then, while the network's output keep enough information to allow statistical algorithms decide the image class.
An implementation using k-nearest neighbour is shown here: https://teachablemachine.withgoogle.com/ , the source is hosted here: https://github.com/googlecreativelab/teachable-machine .
Use transfer learning; you’ll still need to build a training set, but you’ll get better results than starting with random weights. Try to find a model trained on images similar to yours. You might also do some black box testing of the selected model with your curated images to baseline it’s response curve to your images.
I understand XBRL presentation networks very well, and I also understand the mechanisms of prohibiting and overriding relationships, but the way to extend a presentation network with a new, custom concept eludes me.
A presentation network defines a hierarchy of locators, where each locator points to a concept. This makes it possible for the same concept to appear multiple times inside the same network (for example the Equity concept in the Changes of Equity statement, IFRS taxonomy).
The question is: How can an extension taxonomy attach a new arc to a particular locator in the base taxonomy? When an extension taxonomy defines a new arc, that new arc points to two new locators in the same extension taxonomy and will therefore not integrate with the network that is defined by the base taxonomy.
The locators are "proxies" to concepts, and a syntactic detail. From a logical perspective, a presentation network is a DAG of report elements (="concepts" in specese). When prohibiting and overriding, what matters in the resolution of the DTS is the concepts, not the locators.
In specese wording, two relationships are equivalent if their (non-exempt) attributes match, and the XML fragments on the from and to sides are identical. The XML fragments of the from and to sides, in a presentation network, are the concepts, not the locators.
An extension taxonomy will thus have its own locators, but pointing to the same concepts if it wants to reuse them.