YOLOV1 theory of 2 bounding box predictors in each grid cell and its possible power as psuedo anchor boxes - deep-learning

After YOLO1 there was a trend of using anchor boxes for a while in other iterations as priors (I believe the reason was to both speed up the training and detect different sized objects better)
However YOLOV1 has an interesting mechanism where there are k number of bounding box predictors sliding each grid cell in order to be able to specialize in detecting different scaled objects.
Here is what I wonder, ladies and gentlemen:
Given a very long training time, can these bounding box predictors in YOLOV1 achieve better bounding boxes compared to YOLOV9000 or its counterparts that rely on anchor box mechanism

In my experience, yes they can. I observed two possible optimization paths, one of which is already implemented in latest version of YOLOV3 and V5 by Ultralytics (https://github.com/ultralytics/yolov5)
What I observed was that for a YOLOv3, even before training, using a K means clustering we can ascertain a number of ``common box'' shapes. These data when fed into the network as anchor maskes really improved the performance of the YOLOv3 network for "that particular" dataset since the non-max suppression routine had much better chance of succeeding at filtering out spurious detection for particular classes in each of the detection head. To the best of my knowledge, this technique was implemented in latest iterations of their bounding box regression code.
Suppressing certain layers. In YOLOv3, the network performed detection in three stages with the idea of progressively detecting larger objects to smaller objects. YOLOv3 (and in theory V1) can benefit if with some trial and error, you can ascertain which detection head is your network preferring to use based on the common bounding box shapes that you found in step 1.

Related

Keypoint detection when target appears multiple times

I am implementing a keypoint detection algorithm to recognize biomedical landmarks on images. I only have one type of landmark to detect. But in a single image, 1-10 of these landmarks can be present. I'm wondering what's the best way to organize the ground truth to maximize learning.
I considered creating 10 landmark coordinates per image and associate them with flags that are either 0 (not present) or 1 (present). But this doesn't seem ideal. Since the multiple landmarks in a single picture are actually the same type of biomedical element, the neural network shouldn't be trying to learn them as separate entities.
Any suggestions?
One landmark that can appear everywhere sounds like a typical CNN problem. Your CNN filters should learn which features make up the landmark, but they don't care where it appears. That would be the responsibility of the next layers. Hence, for training the CNN layers you can use a monochrome image as the target: 1 is "landmark at this pixel", 0 if not.
The next layers are basically processing the CNN-detected features. To train those, your ground truth should be basically the desired outcome. Do you just need a binary output (count>0)? A somewhat accurate estimate of the count? Coordinates? Orientation? NN's don't care that much what they learn, so just give it in training what it should produce in inference.

CNN attention/activation maps

What are common techniques for finding which parts of images contribute most to image classification via convolutional neural nets?
In general, suppose we have 2d matrices with float values between 0 and 1 as entires. Each matrix is associated with a label (single-label, multi-class) and the goal is to perform classification via (Keras) 2D CNN's.
I'm trying to find methods to extract relevant subsequences of rows/columns that contribute most to classification.
Two examples:
https://github.com/jacobgil/keras-cam
https://github.com/tdeboissiere/VGG16CAM-keras
Other examples/resources with an eye toward Keras would be much appreciated.
Note my datasets are not actual images, so using methods with ImageDataGenerator might not directly apply in this case.
There are many visualization methods. Each of these methods has its strengths and weaknesses.
However, you have to keep in mind that the methods partly visualize different things. Here is a short overview based on this paper.
You can distinguish between three main visualization groups:
Functions (gradients, saliency map): These methods visualize how a change in input space affects the prediction
Signal (deconvolution, Guided BackProp, PatternNet): the signal (reason for a neuron's activation) is visualized. So this visualizes what pattern caused the activation of a particular neuron.
Attribution (LRP, Deep Taylor Decomposition, PatternAttribution): these methods visualize how much a single pixel contributed to the prediction. As a result you get a heatmap highlighting which pixels of the input image most strongly contributed to the classification.
Since you are asking how much a pixel has contributed to the classification, you should use methods of attribution. Nevertheless, the other methods also have their right to exist.
One nice toolbox for visualizing heatmaps is iNNvestigate.
This toolbox contains the following methods:
SmoothGrad
DeConvNet
Guided BackProp
PatternNet
PatternAttribution
Occlusion
Input times Gradient
Integrated Gradients
Deep Taylor
LRP
DeepLift

Cesium Resampling

I know that Cesium offers several different interpolation methods, including linear (or bilinear in 2D), Hermite, and Lagrange. One can use these methods to resample sets of points and/or create curves that approximate sampled points, etc.
However, the question I have is what method does Cesium use internally when it is rendering a 3D scene and the user is zooming/panning all over the place? This is not a case where the programmer has access to the raster, etc, so one can't just get in the middle of it all and call the interpolation functions directly. Cesium is doing its own thing as quickly as it can in response to user control.
My hunch is that the default is bilinear, but I don't know that nor can I find any documentation that explicitly says what is used. Further, is there a way I can force Cesium to use a specific resampling method during these activities, such as Lagrange resampling? That, in fact, is what I need to do: force Cesium to employ Lagrange resampling during scene rendering. Any suggestions would be appreciated.
EDIT: Here's a more detailed description of the problem…
Suppose I use Cesium to set up a 3-D model of the Earth including a greyscale image chip at its proper location on the model Earth's surface, and then I display the results in a Cesium window. If the view point is far enough from the Earth's surface, then the number of pixels displayed in the image chip part of the window will be fewer than the actual number of pixels that are available in the image chip source. Some downsampling will occur. Likewise, if the user zooms in repeatedly, there will come a point at which there are more pixels displayed across the image chip than the actual number of pixels in the image chip source. Some upsampling will occur. In general, every time Cesium draws a frame that includes a pixel data source there is resampling happening. It could be nearest neighbor (doubt it), linear (probably), cubic, Lagrange, Hermite, or any one of a number of different resampling techniques. At my company, we are using Cesium as part of a large government program which requires the use of Lagrange resampling to ensure image quality. (The NGA has deemed that best for its programs and analyst tools, and they have made it a compliance requirement. So we have no choice.)
So here's the problem: while the user is interacting with the model, for instance zooming in, the drawing process is not in the programmer's control. The resampling is either happening in the Cesium layer itself (hopefully) or in even still lower layers (for instance, the WebGL functions that Cesium may be relying on). So I have no clue which technique is used for this resampling. Worse, if that technique is not Lagrange, then I don't have any clue how to change it.
So the question(s) would be this: is Cesium doing the resampling explicitly? If so, then what technique is it using? If not, then what drawing packages and functions are Cesium relying on to render an image file onto the map? (I can try to dig down and determine what techniques those layers may be using, and/or have available.)
UPDATE: Wow, my original answer was a total misunderstanding of your question, so I've rewritten from scratch.
With the new edits, it's clear your question is about how images are resampled for the screen while rendering. These
images are texturemaps, in WebGL, and the process of getting them to the screen quickly is implemented in hardware,
on the graphics card itself. Software on the CPU is not performant enough to map individual pixels to the screen
one at a time, which is why we have hardware-accelerated 3D cards.
Now for the bad news: This hardware supports nearest neighbor, linear, and mapmapping. That's it. 3D graphics
cards do not use any fancier interpolation, as it needs to be done in a fraction of a second to keep frame rate as high as possible.
Mapmapping is described well by #gman in his article WebGL 3D Textures. It's
a long article but search for the word "mipmap" and skip ahead to his description of that. Basically a single image is reduced
into smaller images prior to rendering, so an appropriately-sized starting point can be chosen at render time. But there will
always be a final mapping to the screen, and as you can see, the choices are NEAREST or LINEAR.
Quoting #gman's article here:
You can choose what WebGL does by setting the texture filtering for each texture. There are 6 modes
NEAREST = choose 1 pixel from the biggest mip
LINEAR = choose 4 pixels from the biggest mip and blend them
NEAREST_MIPMAP_NEAREST = choose the best mip, then pick one pixel from that mip
LINEAR_MIPMAP_NEAREST = choose the best mip, then blend 4 pixels from that mip
NEAREST_MIPMAP_LINEAR = choose the best 2 mips, choose 1 pixel from each, blend them
LINEAR_MIPMAP_LINEAR = choose the best 2 mips. choose 4 pixels from each, blend them
I guess the best news I can give you is that Cesium uses the best of those, LINEAR_MIPMAP_LINEAR to
do its own rendering. If you have a strict requirement for more time-consuming imagery interpolation, that means you
have a requirement to not use a realtime 3D hardware-accelerated graphics card, as there is no way to do Lagrange image interpolation during a realtime render.

Randomly Generate Directed Graph on a grid

I am trying to randomly generate a directed graph for the purpose of making a puzzle game similar to the ice sliding puzzles from pokemon.
This is essentially what I want to be able to randomly generate: http://bulbanews.bulbagarden.net/wiki/Crunching_the_numbers:_Graph_theory
I need to be able to limit the size of the graph in an x and y dimension. In the example in the link, it would be restricted to an 8x4 grid.
The problem I am running in to is not randomly generating the graph, but randomly generating a graph which I can properly map out in a 2d space, since I need something (like a rock) on the opposite side of a node, to make it visually make sense when you stop sliding. The problem with this is sometimes the rock ends up in the path between two other nodes or possibly on another node itself, which causes the entire graph to become broken.
After discussing the problem with a few people I know, we came to a couple of conclusions that may lead to a solution. Including the obstacles in the grid as part of the graph when constructing it. Start out with a fully filled grid and just draw a random path and delete out blocks that will make that path work, though the problem then becomes figuring out which ones to delete so that you don't accidentally introduce an additional, shorter path. We were also thinking a dynamic programming algorithm may be beneficial, though none of us are too skilled with creating dynamic programming algorithms from nothing. Any ideas or references about what this problem is officially called (if it's an official graph problem) would be most helpful.
I wouldn't look at it as a graph problem, since as you say the representation is incomplete. To generate a puzzle I would work directly on a grid, and work backwards; first fix the destination spot, then place rocks in some way to reach it from one or more spots, and iteratively add stones to reach those other spots, with the constraint that you never add a stone which breaks all the paths to the destination.
You might want to generate a planar graph, which means that the edges of the graph will not overlap each other in a two dimensional space. Another definition of planar graphs ist that each planar graph does not have any subgraphs of the type K_3,3 (complete bi-partite with six nodes) or K_5 (complete graph with five nodes).
There's a paper on the fast generation of planar graphs.

Electrically charging edges in a force-based graph drawing algorithm?

I'm attempting to write a short mini-program in Python that plays around with force-based algorithms for graph drawing.
I'm trying to minimize the number of times lines intersect. Wikipedia suggests giving the lines an electrical charge so that they repel each other. I asked my physics teacher how I might simulate this, and she mentioned using calculus with Coulomb's Law, but I'm uncertain how to start.
Could somebody give me a hint on how I could do this? (Or alternatively, another way to tweak a force-based graph drawing algorithm to minimize the number of times the lines cross?) I'm just looking for a hint; no source code please.
In case anybody's interested, my source code and a youtube vid I made about it.
You need to explicitly include a term in your cost function that minimizes the number of edge crossings. For example, for every pair of edges that cross, you incur a fixed penalty or, if the edges are weighted, you incur a penalty that is the product of the two weights.