The NN Input Shape for a game of Snake - deep-learning

I am writing a DQN for the game of snake. Currently my neural network works using a input shape of
(12,) with the 12 binary values representing: Food Up,Down,Left,Right . Danger Up,Down,Left,Right Direction Up,Down,Left,Right.
The agent performs well and shows learning however I am playing with the idea of not generalising the state space and instead feeding the network the entire grid 10x10 values with each tile containing possibly (Head,Tail,Food,Nothing). As Input to the neural network, should the input shape be (10,10) with each element containing a number 1-4? in which case how would the network derive information like which direction it is travelling etc?

Related

Keypoint detection when target appears multiple times

I am implementing a keypoint detection algorithm to recognize biomedical landmarks on images. I only have one type of landmark to detect. But in a single image, 1-10 of these landmarks can be present. I'm wondering what's the best way to organize the ground truth to maximize learning.
I considered creating 10 landmark coordinates per image and associate them with flags that are either 0 (not present) or 1 (present). But this doesn't seem ideal. Since the multiple landmarks in a single picture are actually the same type of biomedical element, the neural network shouldn't be trying to learn them as separate entities.
Any suggestions?
One landmark that can appear everywhere sounds like a typical CNN problem. Your CNN filters should learn which features make up the landmark, but they don't care where it appears. That would be the responsibility of the next layers. Hence, for training the CNN layers you can use a monochrome image as the target: 1 is "landmark at this pixel", 0 if not.
The next layers are basically processing the CNN-detected features. To train those, your ground truth should be basically the desired outcome. Do you just need a binary output (count>0)? A somewhat accurate estimate of the count? Coordinates? Orientation? NN's don't care that much what they learn, so just give it in training what it should produce in inference.

what is class score map?

I was going through vggnet paper and i came across the testing phase of vggnet.
During the testing phase, test image goes through the vggnet and a class score map is obtained. This class score map is spatially averaged to produce a fixed size vector.
I have googled class score map, but then i couldn't find any relevant results. I wish to know what is the role of class score map.
Any hint would be greatly helpful. Thanks
When you train an image recognition model, you train it for a specific image size (and resolution), let's say n_dims = [256, 256]. Now, in the prediction phase, you have images of different sizes (with respect to pixels), e.g. [1024, 1024]. You extract patches (you can resize the image first by lowering the resolution) and hover over the image patches with your model, and for each patch, you obtain a prediction for all classes (in a patch, more than one of the objects might be present), which you have to average somehow for the whole image at the end.
See OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks.
Instead, we explore the entire image by densely running the network at each location and at multiple
scales. While the sliding window approach may be computationally prohibitive for certain types
of model, it is inherently efficient in the case of ConvNets (see section 3.5). This approach yields
significantly more views for voting, which increases robustness while remaining efficient. The result
of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at
each scale.

generalized Dice loss for segmentation for Caffe

I am struggling to implement the generalized Dice loss for Caffe as Python Layer, which calculates loss for sub-volumes. I am hoping to get some help here. Or at least, if there is any code, please share the link.
I have 5 labels (0: background and labels1:4 for objects). Since I am getting a patch from 3D data, some of the subvolumes only contain the background. How the dice loss should be calculated for this sub-volumes?
Why in this line of code for creating One-hot label, the author has separated the background voxels counting?
Do we calculate the volume overlap for the background voxels too?

Convolution for state representation

When using DQN, other deep RL algorithms, does it make sense to use convolutional layer in the actor or critic network when you have a state input?
Let's say:
state representation 1: (obj label, position, velocity) of each object in the environment
state representation 2:
There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.
In either case, does it make sense to use a conv layer? Why or why not?
Convolutional layers basically encode the intuition of ""location invariance", the idea that we expect detection of certain "features" ("things", edges, corners, circles, noses, faces, whatevers) to work in roughly the same way regardless of "where" (typically in a 2D space, but could theoretically also be in some other kind of space) they are. This intuition is implemented by having "filters" or "feature detectors" that "slide" along some space.
Let's say: state representation 1: (obj label, position, velocity) of each object in the environment
In this case, the intuition described above does not make sense. The input is not some kind of "space" where we expect to be able to detect similar "shapes" in different locations. A convolutional layer likely would perform poorly here.
state representation 2: There is a tile-based/gridworld style game. We have a 2D grid of numbers describing each object type (1=apple, 2=dog, 3=agent, etc.). We flatten this grid and pass it in as the state to our RL algorithm.
With the 2D grid representation, the intuition encoded by convolutional layers may make sense. For example, to detect useful patterns like dogs being adjacent to, or surrounded by, apples. However, in this case you wouldn't want to flatten the grid; just pass the entire 2D grid as input into whatever framework you're using to implement convolutional layers: it might do some flattening internally, but for the whole concept of convolutional layers the original, unflattened dimensions are highly relevant and important. Encoding of categorical variables as numbers 1, 2, 3, etc. also doesn't tend to work well with Neural Networks. A one-hot encoding (with channels for convolutional layers, one channel per object type) would work better. Just like coloured images tend to have multiple 2D grids (typically a 2D grid for Red, another for Green, and another for Blue in the case of RGB images), you'd want one full grid per object type.

How many neurons does the CNN input layer have?

In all the literature they say the input layer of a convnet is a tensor of shape (width, height, channels). I understand that a fully connected network has an input layer with the number of neurons same as the number of pixels in an image(considering grayscale image). So, my question is how many neurons are in the input layer of a Convolutional Neural Network? The below imageseems misleading(or I have understood it wrong) It says 3 neurons in the input layer. If so what do these 3 neurons represent? Are they tensors? From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)? Please correct me if I am wrong
It seems that you have misunderstood some of the terminology and are also confused that convolutional layers have 3 dimensions.
EDIT: I should make it clear that the input layer to a CNN is a convolutional layer.
The number of neurons in any layer is decided by the developer. For a fully connected layer, usually it is the case that there is a neuron for each input. So as you mention in your question, for an image, the number of neurons in a fully connected input layer would likely be equal to the number of pixels (unless the developer wanted to downsample at this point of something). This also means that you could create a fully connected input layer that takes all pixels in each channel (width, height, channel). Although each input is received by an input neuron only once, unlike convolutional layers.
Convolutional layers work a little differently. Each neuron in a convolutional layer has what we call a local receptive field. This just means that the neuron is not connected to the entire input (this would be called fully connected) but just some section of the input (that must be spatially local). These input neurons provide abstractions of small sections of the input data that when taken together over the whole input we call a feature map.
An important feature of convolutional layers is that they are spatially invariant. This means that they look for the same features across the entire image. After all, you wouldn't want a neural network trained on object recognition to only recognise a bicycle if it is in the bottom left corner of the image! This is achieved by constraining all of the weights across the local receptive fields to be the same. Neurons in a convolutional layer that cover the entire input and look for one feature are called filters. These filters are 2 dimensional (they cover the entire image).
However, having the whole convolutional layer looking for just one feature (such as a corner) would massively limit the capacity of your network. So developers add a number of filters so that the layer can look for a number of features across the whole input. This collection of filters creates a 3 dimensional convolutional layer.
I hope that helped!
EDIT-
Using the example the op gave to clear up loose ends:
OP's Question:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Answer:
First, it is important to note that it is typical (and often important) that the receptive fields overlap. So for an overlap/stride of 2 the 3x3 receptive field of the top left neuron (neuron A), the receptive field of the neuron to its right (neuron B) would also have a 3x3 receptive field, whose leftmost 3 connections could take the same inputs as the rightmost connections of neuron A.
That being said, I think it seems that you would like to visualise this so I will stick to your example were there is no overlap and will assume that we do not want any padding around the image. If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field). Because there are 3 filters, and each has 81 neurons, we would have 243 neurons.
I hope that clears things up. It is clear to me that you are confused with your terminology (layer, filter, neuron, parameter etc.). I would recommend that you read some blogs to better understand these things and then focus on CNNs. Good luck :)
First, lets clear up the image. The image doesn't say there are exactly 3 neurons in the input layer, it is only for visualisation purposes. The image is showing the general architecture of the network, representing each layer with an arbitrary number of neurons.
Now, to understand CNNs, it is best to see how they will work on images.
Images are 2D objects, and in a computer are represented as 2D matrices, each cell having an intensity value for the pixel. An image can have multiple channels, for example, the traditional RGB channels for a colored image. So these different channels can be thought of as values for different dimensions of the image (in case of RGB these are color dimensions) for the same locations in the image.
On the other hand, neural layers are single dimensional. They take input from one end, and give output from the other. So how do we process 2D images in 1D neural layers? Here the Convolutional Neural Networks (CNNs) come into play.
One can flatten a 2D image into a single 1D vector by concatenating successive rows in one channel, then successive channels. An image of size (width, height, channel) will become a 1D vector of size (width x height x channel) which will then be fed into the input layer of the CNN. So to answer your question, the input layer of a CNN has as many neurons as there are pixels in the image across all its channels.
I think you have confusion on the basic concept of a neuron:
From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)?
Think of a neuron as a single computational unit, which cant handle more than one number at a time. So a single neuron cant handle all the pixels of an image at once. A neural layer made up of many neurons is equipped for dealing with a whole image.
Hope this clears up some of your doubts. Please feel free to ask any queries in the comments. :)
Edit:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Is my understanding right? I am just trying to visualize CNNs as neurons with the connections.
A simple way to visualise CNN filters is to imagine them as small windows that you are moving across the image. In your case you have 3 filters of size 3x3.
We generally use multiple filters so as to learn different kinds of features from the same local receptive field (as michael_question_answerer aptly puts it) or simpler terms, our window. Each filters' weights are randomly initialised, so each filter learns a slightly different feature.
Now imagine each filter moving across the image, covering only a 3x3 grid at a time. We define a stride value which specifies how much the window shifts to the right, and how much down. At each position, the filter weights and image pixels at the window will give a single new value in the new volume created. So to answer your question, at an instance a total of 3x3=9 pixels are connected with the 9 neurons corresponding to one filter. The same for the other 2 filters.
Your approach to understanding CNNs by visualisation is correct. But you still need the brush up your basic understanding of terminology. Here are a couple of nice resources that should help:
http://cs231n.github.io/convolutional-networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Hope this helps. Keep up the curiosity :)