Convolution for state representation - deep-learning

When using DQN, other deep RL algorithms, does it make sense to use convolutional layer in the actor or critic network when you have a state input?
Let's say:
state representation 1: (obj label, position, velocity) of each object in the environment
state representation 2:
There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.
In either case, does it make sense to use a conv layer? Why or why not?

Convolutional layers basically encode the intuition of ""location invariance", the idea that we expect detection of certain "features" ("things", edges, corners, circles, noses, faces, whatevers) to work in roughly the same way regardless of "where" (typically in a 2D space, but could theoretically also be in some other kind of space) they are. This intuition is implemented by having "filters" or "feature detectors" that "slide" along some space.
Let's say: state representation 1: (obj label, position, velocity) of each object in the environment
In this case, the intuition described above does not make sense. The input is not some kind of "space" where we expect to be able to detect similar "shapes" in different locations. A convolutional layer likely would perform poorly here.
state representation 2: There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.
With the 2D grid representation, the intuition encoded by convolutional layers may make sense. For example, to detect useful patterns like dogs being adjacent to, or surrounded by, apples. However, in this case you wouldn't want to flatten the grid; just pass the entire 2D grid as input into whatever framework you're using to implement convolutional layers: it might do some flattening internally, but for the whole concept of convolutional layers the original, unflattened dimensions are highly relevant and important. Encoding of categorical variables as numbers 1, 2, 3, etc. also doesn't tend to work well with Neural Networks. A one-hot encoding (with channels for convolutional layers, one channel per object type) would work better. Just like coloured images tend to have multiple 2D grids (typically a 2D grid for Red, another for Green, and another for Blue in the case of RGB images), you'd want one full grid per object type.

Related

Embedded feature vector components: position-dependent?

I have a question about the position of vector components of embedded features.
After the embedding layer, we usually flatten the embedded features to generate a 1D-shaped vector.
Assume we do not use CNN or RNN layers to process the embedded features, but go directly to the fully-connected layers to do classification:
I wonder if the position of every vector component would affect the meaning of the whole vector to the machine-learning algorithm (e.g. in Keras)?
Background of my question: In NLP, sentences have different lengths and the position of words with similar meaning does not always align. After vectorization of n-gram of words with embeddings, I wonder if it is possible to join the flattened embedded vector to the dense layer to do classification without CNN or RNN layer, which increases the size of my network + potentially cause overfitting to the training set.

How many neurons does the CNN input layer have?

In all the literature they say the input layer of a convnet is a tensor of shape (width, height, channels). I understand that a fully connected network has an input layer with the number of neurons same as the number of pixels in an image(considering grayscale image). So, my question is how many neurons are in the input layer of a Convolutional Neural Network? The below imageseems misleading(or I have understood it wrong) It says 3 neurons in the input layer. If so what do these 3 neurons represent? Are they tensors? From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)? Please correct me if I am wrong
It seems that you have misunderstood some of the terminology and are also confused that convolutional layers have 3 dimensions.
EDIT: I should make it clear that the input layer to a CNN is a convolutional layer.
The number of neurons in any layer is decided by the developer. For a fully connected layer, usually it is the case that there is a neuron for each input. So as you mention in your question, for an image, the number of neurons in a fully connected input layer would likely be equal to the number of pixels (unless the developer wanted to downsample at this point of something). This also means that you could create a fully connected input layer that takes all pixels in each channel (width, height, channel). Although each input is received by an input neuron only once, unlike convolutional layers.
Convolutional layers work a little differently. Each neuron in a convolutional layer has what we call a local receptive field. This just means that the neuron is not connected to the entire input (this would be called fully connected) but just some section of the input (that must be spatially local). These input neurons provide abstractions of small sections of the input data that when taken together over the whole input we call a feature map.
An important feature of convolutional layers is that they are spatially invariant. This means that they look for the same features across the entire image. After all, you wouldn't want a neural network trained on object recognition to only recognise a bicycle if it is in the bottom left corner of the image! This is achieved by constraining all of the weights across the local receptive fields to be the same. Neurons in a convolutional layer that cover the entire input and look for one feature are called filters. These filters are 2 dimensional (they cover the entire image).
However, having the whole convolutional layer looking for just one feature (such as a corner) would massively limit the capacity of your network. So developers add a number of filters so that the layer can look for a number of features across the whole input. This collection of filters creates a 3 dimensional convolutional layer.
I hope that helped!
EDIT-
Using the example the op gave to clear up loose ends:
OP's Question:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Answer:
First, it is important to note that it is typical (and often important) that the receptive fields overlap. So for an overlap/stride of 2 the 3x3 receptive field of the top left neuron (neuron A), the receptive field of the neuron to its right (neuron B) would also have a 3x3 receptive field, whose leftmost 3 connections could take the same inputs as the rightmost connections of neuron A.
That being said, I think it seems that you would like to visualise this so I will stick to your example were there is no overlap and will assume that we do not want any padding around the image. If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field). Because there are 3 filters, and each has 81 neurons, we would have 243 neurons.
I hope that clears things up. It is clear to me that you are confused with your terminology (layer, filter, neuron, parameter etc.). I would recommend that you read some blogs to better understand these things and then focus on CNNs. Good luck :)
First, lets clear up the image. The image doesn't say there are exactly 3 neurons in the input layer, it is only for visualisation purposes. The image is showing the general architecture of the network, representing each layer with an arbitrary number of neurons.
Now, to understand CNNs, it is best to see how they will work on images.
Images are 2D objects, and in a computer are represented as 2D matrices, each cell having an intensity value for the pixel. An image can have multiple channels, for example, the traditional RGB channels for a colored image. So these different channels can be thought of as values for different dimensions of the image (in case of RGB these are color dimensions) for the same locations in the image.
On the other hand, neural layers are single dimensional. They take input from one end, and give output from the other. So how do we process 2D images in 1D neural layers? Here the Convolutional Neural Networks (CNNs) come into play.
One can flatten a 2D image into a single 1D vector by concatenating successive rows in one channel, then successive channels. An image of size (width, height, channel) will become a 1D vector of size (width x height x channel) which will then be fed into the input layer of the CNN. So to answer your question, the input layer of a CNN has as many neurons as there are pixels in the image across all its channels.
I think you have confusion on the basic concept of a neuron:
From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)?
Think of a neuron as a single computational unit, which cant handle more than one number at a time. So a single neuron cant handle all the pixels of an image at once. A neural layer made up of many neurons is equipped for dealing with a whole image.
Hope this clears up some of your doubts. Please feel free to ask any queries in the comments. :)
Edit:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Is my understanding right? I am just trying to visualize CNNs as neurons with the connections.
A simple way to visualise CNN filters is to imagine them as small windows that you are moving across the image. In your case you have 3 filters of size 3x3.
We generally use multiple filters so as to learn different kinds of features from the same local receptive field (as michael_question_answerer aptly puts it) or simpler terms, our window. Each filters' weights are randomly initialised, so each filter learns a slightly different feature.
Now imagine each filter moving across the image, covering only a 3x3 grid at a time. We define a stride value which specifies how much the window shifts to the right, and how much down. At each position, the filter weights and image pixels at the window will give a single new value in the new volume created. So to answer your question, at an instance a total of 3x3=9 pixels are connected with the 9 neurons corresponding to one filter. The same for the other 2 filters.
Your approach to understanding CNNs by visualisation is correct. But you still need the brush up your basic understanding of terminology. Here are a couple of nice resources that should help:
http://cs231n.github.io/convolutional-networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Hope this helps. Keep up the curiosity :)

CNN attention/activation maps

What are common techniques for finding which parts of images contribute most to image classification via convolutional neural nets?
In general, suppose we have 2d matrices with float values between 0 and 1 as entires. Each matrix is associated with a label (single-label, multi-class) and the goal is to perform classification via (Keras) 2D CNN's.
I'm trying to find methods to extract relevant subsequences of rows/columns that contribute most to classification.
Two examples:
https://github.com/jacobgil/keras-cam
https://github.com/tdeboissiere/VGG16CAM-keras
Other examples/resources with an eye toward Keras would be much appreciated.
Note my datasets are not actual images, so using methods with ImageDataGenerator might not directly apply in this case.
There are many visualization methods. Each of these methods has its strengths and weaknesses.
However, you have to keep in mind that the methods partly visualize different things. Here is a short overview based on this paper.
You can distinguish between three main visualization groups:
Functions (gradients, saliency map): These methods visualize how a change in input space affects the prediction
Signal (deconvolution, Guided BackProp, PatternNet): the signal (reason for a neuron's activation) is visualized. So this visualizes what pattern caused the activation of a particular neuron.
Attribution (LRP, Deep Taylor Decomposition, PatternAttribution): these methods visualize how much a single pixel contributed to the prediction. As a result you get a heatmap highlighting which pixels of the input image most strongly contributed to the classification.
Since you are asking how much a pixel has contributed to the classification, you should use methods of attribution. Nevertheless, the other methods also have their right to exist.
One nice toolbox for visualizing heatmaps is iNNvestigate.
This toolbox contains the following methods:
SmoothGrad
DeConvNet
Guided BackProp
PatternNet
PatternAttribution
Occlusion
Input times Gradient
Integrated Gradients
Deep Taylor
LRP
DeepLift

Image from concept of a convolutional neural network

Suppose I have a CNN which is trained for classifying images of different animals, the output of such model will be a point (output point) in a n spatial dimension, where n is the number of animal classes the model is trained on; then that output is transformed in a way to convert it into a one-hot vector of n parameters, giving then the correct label for the image from the point of view of the CNN, but let's stick with the n dimensional point, which is the concept of an input image.
Suppose then that I want to take that point and transform it in a way so that the final output is an image with constraint width and height (the dimensions should be the same with different input images) which outputs the same point as the input image's, how do I do that?
I'm basically asking for the methods used (training mostly) for this kind of task, where an image must be reconstructed based on the output point of the CNN -I know the image will never be identical, but I'm looking for images that generate the same (or at least not so different) output point as a input image when that point is inputted to the CNN-. Take in mind that the input of the model I'm asking for is n and the output is a two (or three if it's not in grayscale) dimensional tensor. I noticed that deepdream does exactly this kind of thing (I think), but every time I put "deepdream" and "generate" in Google, an online generator is almost always shown, not the actual techniques; so if there are some answers to this I'd love to hear about them.
The output label does not contain enough information to reconstruct an entire image.
Quoting from the DeepDream example ipython notebook:
Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer.
So the algorithm modifies an existing image such that outputs of certain nodes in the network (can be in an intermediate layer, not necessarily the output nodes) become large. In order to do that, it has to calculate the gradient of the node output with respect to the input pixels.

Store a "routine" which, given some input, generates a 3d model

Well, it's the time of the year were I get busy on my next-generation, cutting edge, R&D project (just for the fun of it...and maybe some profit eventually).
This time, I've had a great idea for a service, which unfortunately I can't detail much.
However, a major part of this project is the ability to generate a 3d model out of certain input criteria. The generated model must be different on each generation.
As such, this is much different than the static models used in games - I think I will have to store actual code more than just model coords.
To give an example of some output:
var apple = new AppleGenerator();
apple->set_size_between(30, 50); // these two numbers are just samples...
apple->set_seeds_between(3, 8); // apple must have at least 3 seeds*
var apple_model = apple->generate();
// * I realize seeds may not be exactly part of the model, but I can't of anything else
So I need to tackle some points here:
How do I store these models as data?
Do you know of any tools that may help?
I need to incorporate a randomness factor (for example, the apples would have slightly different shapes each time)
I suppose math will play a good part here, but since these are complex shapes, it's going to be infeasible to cook up the necessary formulae for each model, right?
Also, textures must be relevant to each part of the model, as well as making the model look random (eg; I could be detailing a 40 to 60 percent red, and the rest green, for the generated apple).
This is in fact not a simple task. The solution varies a LOT depending on the complexity and variety of the objects you are trying to create.
Let's consider a few cases though:
Object is more or less known:
The most simple case is, to have a 3d model in the conventional way, and then randomize it a bit. Take the apple for example. The randomization can vary from the size of the apple to its texture colors to fruit damage.
All your objects can be described using NURBS surfaces:
In this case, you need to store enough data for the surface to be able to be generated, where of course this data can be randomized a bit.
Your objects have rotational symmetry:
In this case, generating a single curve and rotating it around the an axis can give you a shape. An apple is an example. You would need to store only the curve data and randomizing the shape could either be done on the curve (keeping symmetry) or on the final mesh.
On textures
This is way more complicated than the mesh generation. This is mainly because textures carry much more information than meshes (they are more detailed). You can have many texture generation strategies. In the case of your apple, you could select a few vertices, give them colors (one red, one green, another red etc) and interpolate the other vertex colors. This creates a smooth transition of colors which may look nice on an apple. If you are generating a knife however that just looks terrible.
In most cases, you need to be aware of which part of your mesh represents what, and generate the texture part by part. In the knife example above, you can generate the mesh in two steps; blade and handle each part's texture generated separately.
Conclusion
You can have a mixture of these of course. A meshGenerator class can take the data and based on whichever type they are, generates a mesh accordingly. Perhaps the first solution for object creation is the most suitable as any complicated object can be more easily defined by its triangles rather than NURBS.
Take a look at some of the basic architectural principles used to code Spore, the video game about evolving living creatures: http://chrishecker.com/My_liner_notes_for_spore
Here's an example of how to XML-serialize a mesh, along with some random morph behavior: http://www.ogre3d.org/tikiwiki/Morph+animation#The_XML_format_of_meshes_with_morph_animation
To make your apples all a bit different, you can apply a random transformation (or deformation). See for example: http://wiki.blender.org/index.php/Doc:2.4/Manual/Modifiers/Deform/MeshDeform
You want to use an established file format to avoid strange problems. It's more geometry than pure math. Your generate function would plot the polygons, and then your save method would interact with the formats.
https://stackoverflow.com/questions/441388/most-common-3d-model-format