CNN attention/activation maps - deep-learning

What are common techniques for finding which parts of images contribute most to image classification via convolutional neural nets?
In general, suppose we have 2d matrices with float values between 0 and 1 as entires. Each matrix is associated with a label (single-label, multi-class) and the goal is to perform classification via (Keras) 2D CNN's.
I'm trying to find methods to extract relevant subsequences of rows/columns that contribute most to classification.
Two examples:
https://github.com/jacobgil/keras-cam
https://github.com/tdeboissiere/VGG16CAM-keras
Other examples/resources with an eye toward Keras would be much appreciated.
Note my datasets are not actual images, so using methods with ImageDataGenerator might not directly apply in this case.

There are many visualization methods. Each of these methods has its strengths and weaknesses.
However, you have to keep in mind that the methods partly visualize different things. Here is a short overview based on this paper.
You can distinguish between three main visualization groups:
Functions (gradients, saliency map): These methods visualize how a change in input space affects the prediction
Signal (deconvolution, Guided BackProp, PatternNet): the signal (reason for a neuron's activation) is visualized. So this visualizes what pattern caused the activation of a particular neuron.
Attribution (LRP, Deep Taylor Decomposition, PatternAttribution): these methods visualize how much a single pixel contributed to the prediction. As a result you get a heatmap highlighting which pixels of the input image most strongly contributed to the classification.
Since you are asking how much a pixel has contributed to the classification, you should use methods of attribution. Nevertheless, the other methods also have their right to exist.
One nice toolbox for visualizing heatmaps is iNNvestigate.
This toolbox contains the following methods:
SmoothGrad
DeConvNet
Guided BackProp
PatternNet
PatternAttribution
Occlusion
Input times Gradient
Integrated Gradients
Deep Taylor
LRP
DeepLift

Related

Keypoint detection when target appears multiple times

I am implementing a keypoint detection algorithm to recognize biomedical landmarks on images. I only have one type of landmark to detect. But in a single image, 1-10 of these landmarks can be present. I'm wondering what's the best way to organize the ground truth to maximize learning.
I considered creating 10 landmark coordinates per image and associate them with flags that are either 0 (not present) or 1 (present). But this doesn't seem ideal. Since the multiple landmarks in a single picture are actually the same type of biomedical element, the neural network shouldn't be trying to learn them as separate entities.
Any suggestions?
One landmark that can appear everywhere sounds like a typical CNN problem. Your CNN filters should learn which features make up the landmark, but they don't care where it appears. That would be the responsibility of the next layers. Hence, for training the CNN layers you can use a monochrome image as the target: 1 is "landmark at this pixel", 0 if not.
The next layers are basically processing the CNN-detected features. To train those, your ground truth should be basically the desired outcome. Do you just need a binary output (count>0)? A somewhat accurate estimate of the count? Coordinates? Orientation? NN's don't care that much what they learn, so just give it in training what it should produce in inference.

YOLOV1 theory of 2 bounding box predictors in each grid cell and its possible power as psuedo anchor boxes

After YOLO1 there was a trend of using anchor boxes for a while in other iterations as priors (I believe the reason was to both speed up the training and detect different sized objects better)
However YOLOV1 has an interesting mechanism where there are k number of bounding box predictors sliding each grid cell in order to be able to specialize in detecting different scaled objects.
Here is what I wonder, ladies and gentlemen:
Given a very long training time, can these bounding box predictors in YOLOV1 achieve better bounding boxes compared to YOLOV9000 or its counterparts that rely on anchor box mechanism
In my experience, yes they can. I observed two possible optimization paths, one of which is already implemented in latest version of YOLOV3 and V5 by Ultralytics (https://github.com/ultralytics/yolov5)
What I observed was that for a YOLOv3, even before training, using a K means clustering we can ascertain a number of ``common box'' shapes. These data when fed into the network as anchor maskes really improved the performance of the YOLOv3 network for "that particular" dataset since the non-max suppression routine had much better chance of succeeding at filtering out spurious detection for particular classes in each of the detection head. To the best of my knowledge, this technique was implemented in latest iterations of their bounding box regression code.
Suppressing certain layers. In YOLOv3, the network performed detection in three stages with the idea of progressively detecting larger objects to smaller objects. YOLOv3 (and in theory V1) can benefit if with some trial and error, you can ascertain which detection head is your network preferring to use based on the common bounding box shapes that you found in step 1.

Does the number of Instances of an Object in a picture affect the training of a deep-learning object detector

I want to retrain the object detector Yolov4 to recognize figures of the board game Ticket to Ride.
While gathering pictures i was searching for an idea to reduce the amount of needed pictures.
I was wondering if more instances of an object/class in a picture means more "training per picture" which leads to "i need less pictures"
Is this correct? If not could you try to explain in simple terms?
On the roboflow page, they say that the YOLOv4 breaks detecting objects into two pieces:
regression to identify object positioning via bounding boxes;
classification to classify the objects into classes.
Regression (analysis) is - in short - a method of analysis that tries to find the data (images in your case) that is relevant. Classification - on the other hand - transforms the ‘interesting’ images from the previous step into a class (which is ’train piece’, ’tracks’, ’station’ or something else that is worth separating from the rest).
Now, to answer your question: “no, you need more pictures.” When taking more pictures, YOLOv4 is using more samples make / test a more accurate classification. Yet, you have to be careful what you want to classify. You do want the algorithm to extract a ’train’ class from an image, but not an ‘ocean’ class for example. To prevent this, make more (different) pictures of the classes you want to have!

How to decide what layers to use for content and style loss in style transfer?

When you choose a pretrained network for style transfer to determine style and content loss you need to decide which layers best represent the perceived style and content and use them to make said losses. How does one go about choosing such layers?
As far as I know, the way a fully convolutional NN works is that the shallow CNN layers are capturing more simple features such as lines and squares, and the deeper ones are combining them to capture more complex ones(faces, shoes, etc). So for feature caption, you may want to use a deeper layer. When it comes to style, as you might know, it usually uses a combination of different layers.
My two cents.

Store a "routine" which, given some input, generates a 3d model

Well, it's the time of the year were I get busy on my next-generation, cutting edge, R&D project (just for the fun of it...and maybe some profit eventually).
This time, I've had a great idea for a service, which unfortunately I can't detail much.
However, a major part of this project is the ability to generate a 3d model out of certain input criteria. The generated model must be different on each generation.
As such, this is much different than the static models used in games - I think I will have to store actual code more than just model coords.
To give an example of some output:
var apple = new AppleGenerator();
apple->set_size_between(30, 50); // these two numbers are just samples...
apple->set_seeds_between(3, 8); // apple must have at least 3 seeds*
var apple_model = apple->generate();
// * I realize seeds may not be exactly part of the model, but I can't of anything else
So I need to tackle some points here:
How do I store these models as data?
Do you know of any tools that may help?
I need to incorporate a randomness factor (for example, the apples would have slightly different shapes each time)
I suppose math will play a good part here, but since these are complex shapes, it's going to be infeasible to cook up the necessary formulae for each model, right?
Also, textures must be relevant to each part of the model, as well as making the model look random (eg; I could be detailing a 40 to 60 percent red, and the rest green, for the generated apple).
This is in fact not a simple task. The solution varies a LOT depending on the complexity and variety of the objects you are trying to create.
Let's consider a few cases though:
Object is more or less known:
The most simple case is, to have a 3d model in the conventional way, and then randomize it a bit. Take the apple for example. The randomization can vary from the size of the apple to its texture colors to fruit damage.
All your objects can be described using NURBS surfaces:
In this case, you need to store enough data for the surface to be able to be generated, where of course this data can be randomized a bit.
Your objects have rotational symmetry:
In this case, generating a single curve and rotating it around the an axis can give you a shape. An apple is an example. You would need to store only the curve data and randomizing the shape could either be done on the curve (keeping symmetry) or on the final mesh.
On textures
This is way more complicated than the mesh generation. This is mainly because textures carry much more information than meshes (they are more detailed). You can have many texture generation strategies. In the case of your apple, you could select a few vertices, give them colors (one red, one green, another red etc) and interpolate the other vertex colors. This creates a smooth transition of colors which may look nice on an apple. If you are generating a knife however that just looks terrible.
In most cases, you need to be aware of which part of your mesh represents what, and generate the texture part by part. In the knife example above, you can generate the mesh in two steps; blade and handle each part's texture generated separately.
Conclusion
You can have a mixture of these of course. A meshGenerator class can take the data and based on whichever type they are, generates a mesh accordingly. Perhaps the first solution for object creation is the most suitable as any complicated object can be more easily defined by its triangles rather than NURBS.
Take a look at some of the basic architectural principles used to code Spore, the video game about evolving living creatures: http://chrishecker.com/My_liner_notes_for_spore
Here's an example of how to XML-serialize a mesh, along with some random morph behavior: http://www.ogre3d.org/tikiwiki/Morph+animation#The_XML_format_of_meshes_with_morph_animation
To make your apples all a bit different, you can apply a random transformation (or deformation). See for example: http://wiki.blender.org/index.php/Doc:2.4/Manual/Modifiers/Deform/MeshDeform
You want to use an established file format to avoid strange problems. It's more geometry than pure math. Your generate function would plot the polygons, and then your save method would interact with the formats.
https://stackoverflow.com/questions/441388/most-common-3d-model-format