What does caffe do with the mean-binary file ? - caffe

In the caffe-input layer one can define a mean image that holds mean values of all the images used. From the image net example: "The model requires us to subtract the image mean from each image, so we have to compute the mean".
My question is: What is the implementation of this subtraction? Is it simply :
used_image = original_image - mean_image
or
used_image = mean_image - original_iamge
or
used_image = |original_image - mean_image|^2
if it is one of the first two, then how are negative pixels handeld ? Since the pictures are usually stored in uint8 it would mean that it simply starts from the beginning. e.g
200 - 255 = 56
Why I need to know this? I made tests and I know that the second example or the third example would work better.

It's the first one, a trivial normalization step. Using the second instead wouldn't really matter: the weights would invert.
There are no "negative pixels", per se: this is simply integer input to the matrix operations. You are welcome to interpret this as a visual alteration of some sort, but the arithmetic doesn't care.

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

How to use K means clustering to visualise learnt features of a CNN model?

Recently I was going through the paper : "Intriguing Properties of Contrastive Losses"(https://arxiv.org/abs/2011.02803). In the paper(section 3.2) the authors try to determine how well the SimCLR framework has allowed the ResNet50 Model to learn good quality/generalised features that exhibit hierarchical properties. To achieve this, they make use of K-means on intermediate features of the ResNet50 model (intermediate means o/p of block 2,3,4..) & quote the reason -> "If the model learns good representations then regions of similar objects should be grouped together".
Final Results :
KMeans feature visualisation
I am trying to replicate the same procedure but with a different model (like VggNet, Xception), are there any resources explaining how to perform such visualisations ?
The procedure would be as follow:
Let us assume that you want to visualize the 8th layer from VGG. This layer's output might have the shape (64, 64, 256) (I just took some random numbers, this does not correspond to actual VGG). This means that you have 4096 256-dimensional vectors (for one specific image). Now you can apply K-Means on these vectors (for example with 5 clusters) and then color your image corresponding to the clustering result. The coloring is easy, since the 64x64 feature map represents a scaled down version of your image, and thus you just color the corresponding image region for each of these vectors.
I don't know if it might be a good idea to do the K-Means clustering on the combined output of many images, theoretically doing it on many images and one a single one should both give good results (even though for many images you probably would increase the number of clusters to account for the higher variation in your feature vectors).

Indirect Kalman Filter for Inertial Navigation System

I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).

How to detect local maxima and curve windows correctly in semi complex scenarios?

I have a series of data and need to detect peak values in the series within a certain number of readings (window size) and excluding a certain level of background "noise." I also need to capture the starting and stopping points of the appreciable curves (ie, when it starts ticking up and then when it stops ticking down).
The data are high precision floats.
Here's a quick sketch that captures the most common scenarios that I'm up against visually:
One method I attempted was to pass a window of size X along the curve going backwards to detect the peaks. It started off working well, but I missed a lot of conditions initially not anticipated. Another method I started to work out was a growing window that would discover the longer duration curves. Yet another approach used a more calculus based approach that watches for some velocity / gradient aspects. None seemed to hit the sweet spot, probably due to my lack of experience in statistical analysis.
Perhaps I need to use some kind of a statistical analysis package to cover my bases vs writing my own algorithm? Or would there be an efficient method for tackling this directly with SQL with some kind of local max techniques? I'm simply not sure how to approach this efficiently. Each method I try it seems that I keep missing various thresholds, detecting too many peak values or not capturing entire events (reporting a peak datapoint too early in the reading process).
Ultimately this is implemented in Ruby and so if you could advise as to the most efficient and correct way to approach this problem with Ruby that would be appreciated, however I'm open to a language agnostic algorithmic approach as well. Or is there a certain library that would address the various issues I'm up against in this scenario of detecting the maximum peaks?
my idea is simple, after get your windows of interest you will need find all the peaks in this window, you can just compare the last value with the next , after this you will have where the peaks occur and you can decide where are the best peak.
I wrote one simple source in matlab to show my idea!
My example are in wave from audio file :-)
waveFile='Chick_eco.wav';
[y, fs, nbits]=wavread(waveFile);
subplot(2,2,1); plot(y); legend('Original signal');
startIndex=15000;
WindowSize=100;
endIndex=startIndex+WindowSize-1;
frame = y(startIndex:endIndex);
nframe=length(frame)
%find the peaks
peaks = zeros(nframe,1);
k=3;
while(k <= nframe - 1)
y1 = frame(k - 1);
y2 = frame(k);
y3 = frame(k + 1);
if (y2 > 0)
if (y2 > y1 && y2 >= y3)
peaks(k)=frame(k);
end
end
k=k+1;
end
peaks2=peaks;
peaks2(peaks2<=0)=nan;
subplot(2,2,2); plot(frame); legend('Get Window Length = 100');
subplot(2,2,3); plot(peaks); legend('Where are the PEAKS');
subplot(2,2,4); plot(frame); legend('Peaks in the Window');
hold on; plot(peaks2, '*');
for j = 1 : nframe
if (peaks(j) > 0)
fprintf('Local=%i\n', j);
fprintf('Value=%i\n', peaks(j));
end
end
%Where the Local Maxima occur
[maxivalue, maxi]=max(peaks)
you can see all the peaks and where it occurs
Local=37
Value=3.266296e-001
Local=51
Value=4.333496e-002
Local=65
Value=5.049438e-001
Local=80
Value=4.286804e-001
Local=84
Value=3.110046e-001
I'll propose a couple of different ideas. One is to use discrete wavelets, the other is to use the geographer's concept of prominence.
Wavelets: Apply some sort of wavelet decomposition to your data. There are multiple choices, with Daubechies wavelets being the most widely used. You want the low frequency peaks. Zero out the high frequency wavelet elements, reconstruct your data, and look for local extrema.
Prominence: Those noisy peaks and valleys are of key interest to geographers. They want to know exactly which of a mountain's multiple little peaks is tallest, the exact location of the lowest point in the valley. Find the local minima and maxima in your data set. You should have a sequence of min/max/min/max/.../min. (You might want to add an arbitrary end points that are lower than your global minimum.) Consider a min/max/min sequence. Classify each of these triples per the difference between the max and the larger of the two minima. Make a reduced sequence that replaces the smallest of these triples with the smaller of the two minima. Iterate until you get down to a single min/max/min triple. In your example, you want the next layer down, the min/max/min/max/min sequence.
Note: I'm going to describe the algorithmic steps as if each pass were distinct. Obviously, in a specific implementation, you can combine steps where it makes sense for your application. For the purposes of my explanation, it makes the text a little more clear.
I'm going to make some assumptions about your problem:
The windows of interest (the signals that you are looking for) cover a fraction of the entire data space (i.e., it's not one long signal).
The windows have significant scope (i.e., they aren't one pixel wide on your picture).
The windows have a minimum peak of interest (i.e., even if the signal exceeds the background noise, the peak must have an additional signal excess of the background).
The windows will never overlap (i.e., each can be examined as a distinct sub-problem out of context of the rest of the signal).
Given those, you can first look through your data stream for a set of windows of interest. You can do this by making a first pass through the data: moving from left to right, look for noise threshold crossing points. If the signal was below the noise floor and exceeds it on the next sample, that's a candidate starting point for a window (vice versa for the candidate end point).
Now make a pass through your candidate windows: compare the scope and contents of each window with the values defined above. To use your picture as an example, the small peaks on the left of the image barely exceed the noise floor and do so for too short a time. However, the window in the center of the screen clearly has a wide time extent and a significant max value. Keep the windows that meet your minimum criteria, discard those that are trivial.
Now to examine your remaining windows in detail (remember, they can be treated individually). The peak is easy to find: pass through the window and keep the local max. With respect to the leading and trailing edges of the signal, you can see n the picture that you have a window that's slightly larger than the actual point at which the signal exceeds the noise floor. In this case, you can use a finite difference approximation to calculate the first derivative of the signal. You know that the leading edge will be somewhat to the left of the window on the chart: look for a point at which the first derivative exceeds a positive noise floor of its own (the slope turns upwards sharply). Do the same for the trailing edge (which will always be to the right of the window).
Result: a set of time windows, the leading and trailing edges of the signals and the peak that occured in that window.
It looks like the definition of a window is the range of x over which y is above the threshold. So use that to determine the size of the window. Within that, locate the largest value, thus finding the peak.
If that fails, then what additional criteria do you have for defining a region of interest? You may need to nail down your implicit assumptions to more than 'that looks like a peak to me'.

Programmatically obtaining the number of colors used in an image

Question:
Given an image in PNG format, what is the simplest way to programmatically obtain the number of colors used in the image?
Constraints:
The solution will be integreted into a shell script running under Linux, so any solution that fits in such an environment will do.
Please note that the "color capacity of the image file" does not necessarily correspond to "colors used". Example: In an image file with a theoretical color capacity of 256 colors only say 7 colors might be in actual use. I want to obtain the number of colors actually used.
Why write your own program?
If you're doing this with a shell script, you can use the netpbm utilities:
count = `pngtoppm png_file | ppmhist -noheader | wc -l`
The Image.getcolors method in Python Imaging Library seems to do exactly what you want.
Fun. There doesn't appear to be any guaranteed method of doing this; in the worst case you'll need to scan the image and interpret every pixel, in the best possible case the PNG will be using a palette and you can just check there.
Even in the palette case, though, you're not guaranteed that every entry is used -- so you're (at best) getting an upper bound.
http://www.libpng.org/pub/png/spec/1.1/PNG-Contents.html
.. and the chunk info here:
http://www.libpng.org/pub/png/spec/1.1/PNG-Chunks.html
Alnitak's solution is nice :) I really should get to know netpbm and imagemagick etc. better some time.
Just FYI, as a simple and very general solution: loop through each pixel in the image, getting the r,g,b color values as a single integer. Look for that integer in a list. If it's not there, add it. When finished with all the pixels, print the number of colors in the list.
If you want to count occurences, use a hashmap/dictionary instead of a simple list, incrementing the key's value (a counter) if found in the dictionary already. If not found, add it with a starting counter value of 1.