DenseNet without convolution? - deep-learning

The recent paper Densely Connected Convolutional Networks https://arxiv.org/abs/1608.06993 has shown that their DenseNet deep learning architecture outperforms state-of-the-art ResNet architectures. Are there similar papers / repositories for similar architectures but without convolution (RNN/just dense)?

No.
The simple answer is that the convolution itself allows for regularization by exploiting the data locality which is true in most images. This is also the key to achieving a deeper network which is crucial for deeper representations.
Another critical reason is that a dense layer just the size of the input (usually 224*224) will hog down most of your GPU memory so there is little chance today to achieve a dense network for images of this size that or more than a few layers deep. Maybe if you had 10x the GPU RAM you can try to pull that one off... Convolution is simply more economical.

Related

How is AlexNet 8 layers deep?

I'm trying to understand why for example on MatLab page AlexNet is described as:
AlexNet is a convolutional neural network that is 8 layers deep.
After using analyzeNetwork() to check the architecture, there is clearly 25 layers.
How 25 layers are related to 8 layers deep? What's the difference between those two values?
I'm sure that I'm missing something, but I don't know what it is.
The MATLAB documentation is probably not clear enough. I should maybe talk about blocks (Personally I prefer this word). If you look at the figure:
Many "layers" have at the end a number that represents the block in which it is contained.
The term layer is often not clear, there are people who consider that a convolution + activation + batch norm is a layer. There is no consensus. In the case of MATLAB it is only counting the layers that have weights.

Why does CNNs usually have a stem?

Most cutting-edge/famous CNN architectures have a stem that does not use a block like the rest of the part of the network, instead, most architectures use plain Conv2d or pooling in the stem without special modules/layers like a shortcut(residual), an inverted residual, a ghost conv, and so on.
Why is this? Are there experiments/theories/papers/intuitions behind this?
examples of stems:
classic ResNet: Conv2d+MaxPool:
bag of tricks ResNet-C: 3*Conv2d+MaxPool,
even though 2 Conv2d can form the exact same structure as a classic residual block as shown below in [figure 2], there is no shortcut in stem:
there are many other examples that have similar observations, such as EfficientNet, MobileNet, GhostNet, SE-Net, and so on.
cite:
https://arxiv.org/abs/1812.01187
https://arxiv.org/abs/1512.03385
As far as I know, this is done in order to quickly downsample an input image with strided convolutions of quite large kernel size (5x5 or 7x7) so that further layers can effectively do their work with much less computational complexity.
This is because these specialized modules can do no more than just convolutions. The difference is in the trainability of the resulting architecture. For example, the skip connections in ResNet are meant to bypass some layers when these are still so badly trained that they do not propagate the useful information from the input to the output. However, when fully trained, the skip connections could in theory be completely removed (or integrated) since the information can still propagate throught the layers that would otherwise be skipped. However, when you are using a backbone that you dont intend to train yourself, it does not make sence to include architectural features that are aimed at trainability. Instead, you can "compless" the backbone leaving only relatively fundamental operations and freeze all weights. This saves computational costs both when training the head as well as in the final deployment.
Stem layers work as a compression mechanism over the initial image.
This leads to a fast reduction in the spatial size of the activations, reducing memory and computational costs.

Is Hardware Accelerated Min/Max Ray Casting Available with Cuda/Optix?

I am wondering if it is possible in Cuda or Optix to accelerate the computation of the minimum and maximum value along a line/ray casted from one point to another in a 3D volume.
If not, is there any special hardware on Nvidia GPU's that can accelerate this function (particularly on Volta GPUs or Tesla K80's)?
The short answer to the title question is: yes, hardware accelerated ray casting is available in CUDA & OptiX. The longer question has multiple interpretations, so I'll try to outline the different possibilities.
The different axes of your question that I'm seeing are: CUDA vs OptiX, pre-RTX GPUs vs RTX GPUs (e.g., Volta vs Ampere), min ray queries vs max ray queries, and possibly surface representations vs volume representations.
pre-RTX vs RTX GPUs:
To perhaps state the obvious, a K80 or a GV100 GPU can be used to accelerate ray casting compared to a CPU, due to the highly parallel nature of the GPU. However, these pre-RTX GPUs don't have any hardware that is specifically dedicated to ray casting. There are bits of somewhat special purpose hardware not dedicated to ray casting that you could probably leverage in various ways, so up to you to identify and design these kinds of hardware acceleration hacks.
The RTX GPUs starting with the Turing architecture do have specialized hardware dedicated to ray casting, so they accelerate ray queries even further than the acceleration you get from using just any GPU to parallelize the ray queries.
CUDA vs OptiX:
CUDA can be used for parallel ray tracing on any GPUs, but it does not currently (as I write this) support access to the specialized RTX hardware for ray tracing. When using CUDA, you would be responsible for writing all the code to build an acceleration structure (e.g. BVH) & traverse rays through the acceleration structure, and you would need to write the intersection and shading or hit-processing programs.
OptiX, Direct-X, and Vulkan all allow you to access the specialized ray-tracing hardware in RTX GPUs. By using these APIs, one can achieve higher speeds with lower power requirements, and they also require much less effort because the intersections and ray traversal through an acceleration structure are provided for you. These APIs also provide other commonly needed features for production-level ray casting, things like instancing, transforms, motion blur, as well as a single-threaded programming model for processing ray hits & misses.
Min vs Max ray queries:
OptiX has built-in functionality to return the surface intersection closest to the ray origin, i.e. a 'min query'. OptiX does not provide a similar single query for the furthest intersection (which is what I assume you mean by "max"). To find the maximum distance hit, or the closest hit to a second point on your ray, you would need to track through multiple hits and keep track of the hit that you want.
In CUDA you're on your own for detecting both min and max queries, so you can do whatever you want as long as you can write all the code.
Surfaces vs Volumes:
Your question mentioned a "3D volume", which has multiple meanings, so just to clarify things:
OptiX (+ DirectX + Vulkan) are APIs for ray tracing of surfaces, for example triangles meshes. The RTX specialty hardware is dedicated to accelerating ray tracing of surface based representations.
If your "3D volume" is referring to a volumetric representation such as voxel data or a tetrahedral mesh, then surface-based ray tracing might not be the fastest or most appropriate way to cast ray queries. In this case, you might want to use "ray marching" techniques in CUDA, or look at volumetric ray casting APIs for GPUs like NanoVDB.

caffe - how to properly train alexnet with only 7 classes

I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.

Can I train a deep convolutional network without GPUs?

I am thinking of building a convolutional neural network as a tracking system application.I get the feeling that all the deep network applications require the use of GPUs. Is it necessary to use GPUs in a task like mine? What are the minimum PC requirements I should have in my laptop ?
It all depends on the size and depth of your CNN. If your CNN has one convolution layer, and one fully connected layer, and input images are 64x64, you will be able to train your network on your Laptop in a reasonable time. If you use GoogLeNet with hundred of layers, and train on the entire ImageNet set, than even with a video card it will take you a week, so on a CPU it will never finish training.
For most practical applications, however, it is desirable to have a GPU to train a convolution network. Note that on AWS you can get GPU-enabled instances for a rather reasonable price, especially if you get spot instances, so you don't necessarily need to have a GPU locally.
Last note: most of the frameworks (theano, torch, caffe, mxnet, tensorflow) allow you to execute the same model on CPU and on GPU with minor or no modifications to the code, so you can prototype locally on the CPU with a small set of images, and then when your model works, train it on AWS on a GPU instance.