Tackling imbalanced class members in Caffe: weight contribution of each instance to loss value - deep-learning

I have a highly imbalanced data, I know that some users suggesting using InfoGainLoss loss function, however, I am facing few errors when I tried to add this function to Caffe layers.
I have the following questions, I really appreciate if someone guides me:
How can I add this layer to Caffe? Does anyone know any sources/ codes of this layer?
I want to apply it for image segmentation and the proportion of some classes varies. How can I create the H matrix (a stack of weights) for my images? And how infoGainLoss layer can read a specific weight matrix (H) related to that specific image?
After adding the cpp and cu version of InforGainLoss layer to caffe, should I remake Caffe?
I am sorry for few question, but all are my concern and related to each other. I will be thankful to get some help and support.
Thanks

1.If you copy from current infogain_loss_layer.cpp you can easily adapt. For forward pass change line 59-66 like:
// assuming num = batch size, dim = label size, image_dim = image height * width
Dtype loss = 0;
for (int i = 0; i < num; ++i) {
for(int k = 0; k < image_dim; k++) {
int label = static_cast<int>(bottom_label[i*image_dim+k]);
for (int j = 0; j < dim; ++j) {
Dtype prob = std::max(bottom_data[i *image_dim *dim+ k * dim + j], Dtype(kLOG_THRESHOLD));
loss -= infogain_mat[label * dim + j] * log(prob);
}
}
}
Similarly for backward pass you could change line 95-101 like:
for (int i = 0; i < num; ++i) {
for(int k = 0; k < image_dim; k++) {
const int label = static_cast<int>(bottom_label[i*image_dim+k]);
for (int j = 0; j < dim; ++j) {
Dtype prob = std::max(bottom_data[i *image_dim *dim+ k * dim + j], Dtype(kLOG_THRESHOLD));
bottom_diff[i *image_dim *dim+ k * dim + j] = scale * infogain_mat[label * dim + j] / prob;
}
}
}
This is kind of naive. I don't seem to find any option for optimization. You will also need to change some setup code in reshape.
2.In this PR suggestion is that for diagonal entries in H put min_count/|i| where |i| is the number of samples has label i. Everything else as 0. Also see this . As for loading the weight matrix H is fixed for all input. You can load it as lmdb file or in other ways.
3.Yes you will need to rebuild.
Update:
As Shai pointed out the infogain pull for this has already been approved this week. So current version of caffe supports pixelwise infogain loss.

Related

CUDA run lead to display drive stop

I wrote a function. When I run it in the cpu I can get the right result. The part of cpu code is:
for(int x = startx; x < endx; x+=SampleStep)
for(int y = starty; y < endy; y+=SampleMin)
{
int idoff = Width;
Then I port it to the GPU, like this:
int x = threadIdx.x + blockIdx.x * blockDim.x + startx;
int y = threadIdx.y + blockIdx.y * blockDim.y + starty;
int idoff = blockDim.x * gridDim.x;
when I run the code, the black screen happened and then recovered after a little while. At the same time, the system showed the message like: Display drive stopped responding.
and the cuda event time output cost time is 0ms, the result is wrong.
for (int k = CircleBegin; k < CircleEnd; k++)
{
bool Isright = (k-ww>=0) && (k+ww<Width);
if (Isright)
{
float AverR = 0;
for (int i = -ww; i <= ww; i++)
{
for (int j = -wh; j <= wh; j++)
{
AverR += ImgR[(k+i)+(y+j)*idoff];
}
}
when I comment the AverR += ImgR[(k+i)+(y+j)*idoff]; The code can run without black screen. I want to know why. Is this related with my display device (my device is nvida gt 240) or is there some access violation happened? how can I solve this problem?
Your screen is turning black because you are hitting the windows TDR event. For further description of this and possible solutions, see my answer here.
Since you have nested for-loops, and you haven't told us the size of the data set, it's certainly possible that your code is taking too long to execute, if your for-loops are operating over a large enough range.
When you comment out that line of code, the compiler can completely optimize away the loops, and so that section of code would take essentially zero time to run. As a result, your kenel is no longer taking too long and so you don't hit the TDR event.
There's no reason based on any of the above to assume an access violation is occurring. In fact, I would say it is unlikely because an access violation will often cause an unspecified launch failure, which will terminate a running kernel.
So you'll need to investigate some of the ideas I mentioned in the answer I linked above.

OCR: weighted Levenshtein distance

I'm trying to create an optical character recognition system with the dictionary.
In fact I don't have an implemented dictionary yet=)
I've heard that there are simple metrics based on Levenstein distance which take in account different distance between different symbols. E.g. 'N' and 'H' are very close to each other and d("THEATRE", "TNEATRE") should be less than d("THEATRE", "TOEATRE") which is impossible using basic Levenstein distance.
Could you help me locating such metric, please.
This might be what you are looking for: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance (and kindly some working code is included in the link)
Update:
http://nlp.stanford.edu/IR-book/html/htmledition/edit-distance-1.html
Here is an example (C#) where weight of "replace character" operation depends on distance between character codes:
static double WeightedLevenshtein(string b1, string b2) {
b1 = b1.ToUpper();
b2 = b2.ToUpper();
double[,] matrix = new double[b1.Length + 1, b2.Length + 1];
for (int i = 1; i <= b1.Length; i++) {
matrix[i, 0] = i;
}
for (int i = 1; i <= b2.Length; i++) {
matrix[0, i] = i;
}
for (int i = 1; i <= b1.Length; i++) {
for (int j = 1; j <= b2.Length; j++) {
double distance_replace = matrix[(i - 1), (j - 1)];
if (b1[i - 1] != b2[j - 1]) {
// Cost of replace
distance_replace += Math.Abs((float)(b1[i - 1]) - b2[j - 1]) / ('Z'-'A');
}
// Cost of remove = 1
double distance_remove = matrix[(i - 1), j] + 1;
// Cost of add = 1
double distance_add = matrix[i, (j - 1)] + 1;
matrix[i, j] = Math.Min(distance_replace,
Math.Min(distance_add, distance_remove));
}
}
return matrix[b1.Length, b2.Length] ;
}
You see how it works here: http://ideone.com/RblFK
A few years too late but the following python package (with which I am NOT affiliated) allows for arbitrary weighting of all the Levenshtein edit operations and ASCII character mappings etc.
https://github.com/infoscout/weighted-levenshtein
pip install weighted-levenshtein
Also this one (also not affiliated):
https://github.com/luozhouyang/python-string-similarity
I've recently created a python package that does exactly that https://github.com/zas97/ocr_weighted_levenshtein.
In my Weigthed-Levenshtein implementation the distance between "THEATRE" and "TNEATRE" is 1.3 while the distance between "THEATRE" and "TOEATRE" is 1.42.
Other exemples are the d("O", "0") is 0.06 and d("e","c") is 0.57.
This distances have been calculated by running multiple ocrs in a synthetic dataset and doing statistics on the most common ocr errors. I hope it helps someone =)

cuda kernel for conway's game of life

I'm trying to calculate the number of transitions that would be made in a run of Conway's GOL for a pxq matrix for n iterations. For instance, given 1 iteration with the initial state being 1 blinker (as below). there would be 5 transitions (2 births, 1 survival, 2 deaths from underpopulation). I've already got this working, but I'd like to convert this logic to run using CUDA. Below is what I want to port to CUDA.
code:
static void gol() // call this iterations x's
{
int[] tempGrid = new int[rows * cols]; // grid holds init conditions
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
tempGrid[i * cols + j] = grid[i * cols + j];
}
}
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < cols; j++)
{
int numNeighbors = neighbors(i, j); // finds # of neighbors
if (grid[i * cols + j] == 1 && numNeighbors > 3)
{
tempGrid[i * cols + j] = 0;
overcrowding++;
}
else if (grid[i * cols + j] == 1 && numNeighbors < 2)
{
tempGrid[i * cols + j] = 0;
underpopulation++;
}
else if (grid[i * cols + j] == 1 && numNeighbors > 1)
{
tempGrid[i * cols + j] = 1;
survival++;
}
else if (grid[i * cols + j] == 0 && numNeighbors == 3)
{
tempGrid[i * cols + j] = 1;
birth++;
}
}
}
grid = tempGrid;
}
Your main slowdown is going to be main memory access. So I'd suggest that you pick a largish thread block size based on the hardware you have available. 256 (16x16) is a good choice for cross-hardware compatibility. Each of those thread blocks is going to calculate the results for a slightly smaller section of the board -- if you used 16x16, they'll calculate the results for a 14x14 section of the board, since there is a one element border. (The reason to use a 16x16 block to calculate a 14x14 chunk rather than a 16x16 chunk is for memory read coalescing.)
Divide the board up into (say) 14x14 chunks; that is your grid (organized however you see fit, but most likely something like board_width / 14, board_height / 14.
Within the kernels, have each thread load its element into shared memory. Then syncthreads. Then have the middle 14x14 elements calculate the new value (using the values stored in shared memory) and write it back into global memory. The use of shared memory helps minimize global reads and writes. This is also the reason to have your thread block size as big as possible -- the edges and corners are "wasted" global memory accesses, since the values fetched there only get used 1 or 3 times, not 9 times.
Here's one way you could proceed:
Each thread makes the computation for 1 element of the grid
Each thread first loads up one element from the main grid into shared memory
Threads on the edge of the thread block need also to load up boundary elements
Each thread can then make their survival computation based on the contents of shared memory
Each thread then writes their result back to main memory

Cummulative array summation using OpenCL

I'm calculating the Euclidean distance between n-dimensional points using OpenCL. I get two lists of n-dimensional points and I should return an array that contains just the distances from every point in the first table to every point in the second table.
My approach is to do the regular doble loop (for every point in Table1{ for every point in Table2{...} } and then do the calculation for every pair of points in paralell.
The euclidean distance is then split in 3 parts:
1. take the difference between each dimension in the points
2. square that difference (still for every dimension)
3. sum all the values obtained in 2.
4. Take the square root of the value obtained in 3. (this step has been omitted in this example.)
Everything works like a charm until I try to accumulate the sum of all differences (namely, executing step 3. of the procedure described above, line 49 of the code below).
As test data I'm using DescriptorLists with 2 points each:
DescriptorList1: 001,002,003,...,127,128; (p1)
129,130,131,...,255,256; (p2)
DescriptorList2: 000,001,002,...,126,127; (p1)
128,129,130,...,254,255; (p2)
So the resulting vector should have the values: 128, 2064512, 2130048, 128
Right now I'm getting random numbers that vary with every run.
I appreciate any help or leads on what I'm doing wrong. Hopefully everything is clear about the scenario I'm working in.
#define BLOCK_SIZE 128
typedef struct
{
//How large each point is
int length;
//How many points in every list
int num_elements;
//Pointer to the elements of the descriptor (stored as a raw array)
__global float *elements;
} DescriptorList;
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float As[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//temporary array to store the difference between each dimension of 2 points
float dif_acum[BLOCK_SIZE];
//counter to track the iterations of the inner loop
int loop = 0;
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the i-th descriptor. Returns a DescriptorList with just the i-th
//descriptor in DescriptorList A
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory.
//returns one element of the only descriptor in DescriptorList tmpA
//and index featA
As[featA] = GetElement(tmpA, 0, featA);
//wait for all the threads to finish copying before continuing
barrier(CLK_LOCAL_MEM_FENCE);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current points
dif_acum[featA] = As[featA]-B.elements[k*BLOCK_SIZE + featA];
//wait again
barrier(CLK_LOCAL_MEM_FENCE);
//square value of the difference in dif_acum and store in C
//which is where the results should be stored at the end.
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
loop += 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}
Your problem lies in these lines of code:
C[loop] = 0;
C[loop] += dif_acum[featA]*dif_acum[featA];
All threads in your workgroup (well, actually all your threads, but lets come to to that later) are trying to modify this array position concurrently without any synchronization whatsoever. Several factors make this really problematic:
The workgroup is not guaranteed to work completely in parallel, meaning that for some threads C[loop] = 0 can be called after other threads have already executed the next line
Those that execute in parallel all read the same value from C[loop], modify it with their increment and try to write back to the same address. I'm not completely sure what the result of that writeback is (I think one of the threads succeeds in writing back, while the others fail, but I'm not completely sure), but its wrong either way.
Now lets fix this:
While we might be able to get this to work on global memory using atomics, it won't be fast, so lets accumulate in local memory:
local float* accum;
...
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
Of course you can reuse other local buffers for this, but I think the point is clear (btw: Are you sure that dif_acum will be created in local memory, because I think I read somewhere that this wouldn't be put in local memory, which would make preloading A into local memory kind of pointless).
Some other points about this code:
Your code is seems to be geared to using only on workgroup (you aren't using either groupid nor global id to see which items to work on), for optimal performance you might want to use more then that.
Might be personal preferance, but I to me it seems better to use get_local_size(0) for the workgroupsize than to use a Define (since you might change it in the host code without realizing you should have changed your opencl code to)
The barriers in your code are all unnecessary, since no thread accesses an element in local memory which is written by another thread. Therefore you don't need to use local memory for this.
Considering the last bullet you could simply do:
float As = GetElement(tmpA, 0, featA);
...
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
This would make the code (not considering the first two bullets):
__kernel void CompareDescriptors_deb(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
int loop = 0;
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
DescriptorList tmpA = GetDescriptor(A, i);
float As = GetElement(tmpA, 0, featA);
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
float dif_acum = As-B.elements[k*BLOCK_SIZE + featA];
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
for(unsigned int i = 1; i < BLOCKSIZE; i *= 2)
{
if ((featA % (2*i)) == 0)
accum[featA] += accum[featA + i];
barrier(CLK_LOCAL_MEM_FENCE);
}
if(featA == 0)
C[loop] = accum[0];
barrier(CLK_LOCAL_MEM_FENCE);
loop += 1;
}
}
}
Thanks to Grizzly, I have now a working kernel. Some things I needed to modify based in the answer of Grizzly:
I added an IF statement at the beginning of the routine to discard all threads that won't reference any valid position in the arrays I'm using.
if(featA > BLOCK_SIZE){return;}
When copying the first descriptor to local (shared) memory (i.g. to Bs), the index has to be specified since the function GetElement returns just one element per call (I skipped that on my question).
Bs[featA] = GetElement(tmpA, 0, featA);
Then, the SCAN loop needed a little tweaking because the buffer is being overwritten after each iteration and one cannot control which thread access the data first. That is why I'm 'recycling' the dif_acum buffer to store partial results and that way, prevent inconsistencies throughout that loop.
dif_acum[featA] = accum[featA];
There are also some boundary control in the SCAN loop to reliably determine the terms to be added together.
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
Last, I thought it made sense to include the loop variable increment within the last IF statement so that only one thread modifies it.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
That's it. I still wonder how can I make use of group_size to eliminate that BLOCK_SIZE definition and if there are better policies I can adopt regarding thread usage.
So the code looks finally like this:
__kernel void CompareDescriptors(__global float *C, DescriptorList A, DescriptorList B, int elements, __local float accum[BLOCK_SIZE], __local float Bs[BLOCK_SIZE])
{
int gpidA = get_global_id(0);
int featA = get_local_id(0);
//global counter to store final differences
int loop = 0;
//auxiliary buffer to store temporary data
local float dif_acum[BLOCK_SIZE];
//discard the threads that are not going to be used.
if(featA > BLOCK_SIZE){
return;
}
//loop over all descriptors in A
for (int i = 0; i < A.num_elements/BLOCK_SIZE; i++){
//take the gpidA-th descriptor
DescriptorList tmpA = GetDescriptor(A, i);
//copy the current descriptor to local memory
Bs[featA] = GetElement(tmpA, 0, featA);
//loop over all the descriptors in B
for (int k = 0; k < B.num_elements/BLOCK_SIZE; k++){
//take the difference of both current descriptors
dif_acum[featA] = Bs[featA]-B.elements[k*BLOCK_SIZE + featA];
//square the values in dif_acum
accum[featA] = dif_acum[featA]*dif_acum[featA];
barrier(CLK_LOCAL_MEM_FENCE);
//copy the values of accum to keep consistency once the scan procedure starts. Mostly important for the first element. Two buffers are necesarry because the scan procedure would override values that are then further read if one buffer is being used instead.
dif_acum[featA] = accum[featA];
//Compute the accumulated sum (a.k.a. scan)
for(int j = 1; j < BLOCK_SIZE; j *= 2){
int next_addend = featA-(j/2);
if (featA >= j && next_addend >= 0 && next_addend < BLOCK_SIZE){
dif_acum[featA] = accum[featA] + accum[next_addend];
}
barrier(CLK_LOCAL_MEM_FENCE);
//copy As to accum
accum[featA] = GetElementArray(dif_acum, BLOCK_SIZE, featA);
barrier(CLK_LOCAL_MEM_FENCE);
}
//tell one of the threads to write the result of the scan in the array containing the results.
if(featA == 0){
C[loop] = accum[BLOCK_SIZE-1];
loop += 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
}
}

1D multiple peak detection?

I am currently trying to implement basic speech recognition in AS3. I need this to be completely client side, as such I can't access powerful server-side speech recognition tools. The idea I had was to detect syllables in a word, and use that to determine the word spoken. I am aware that this will grealty limit the capacities for recognition, but I only need to recognize a few key words and I can make sure they all have a different number of syllables.
I am currently able to generate a 1D array of voice level for a spoken word, and I can clearly see, if I somehow draw it, that there are distinct peaks for the syllables in most of the cases. However, I am completely stuck as to how I would find out those peaks. I only really need the count, but I suppose that comes with finding them. At first I thought of grabbing a few maximum values and comparing them with the average of values but I had forgot about that peak that is bigger than the others and as such, all my "peaks" were located on one actual peak.
I stumbled onto some Matlab code that looks almost too short to be true, but I can't very that as I am unable to convert it to any language I know. I tried AS3 and C#. So I am wondering if you guys could start me on the right path or had any pseudo-code for peak detection?
The matlab code is pretty straightforward. I'll try to translate it to something more pseudocodeish.
It should be easy to translate to ActionScript/C#, you should try this and post follow-up questions with your code if you get stuck, this way you'll have the best learning effect.
Param: delta (defines kind of a tolerance and depends on your data, try out different values)
min = Inf (or some very high value)
max = -Inf (or some very low value)
lookformax = 1
for every datapoint d [0..maxdata] in array arr do
this = arr[d]
if this > max
max = this
maxpos = d
endif
if this < min
min = this
minpos = d
endif
if lookformax == 1
if this < max-delta
there's a maximum at position maxpos
min = this
minpos = d
lookformax = 0
endif
else
if this > min+delta
there's a minimum at position minpos
max = this
maxpos = d
lookformax = 1
endif
endif
Finding peaks and valleys of a curve is all about looking at the slope of the line. At such a location the slope is 0. As i am guessing a voice curve is very irregular, it must first be smoothed, until only significant peaks exist.
So as i see it the curve should be taken as a set of points. Groups of points should be averaged to produce a simple smooth curve. Then the difference of each point should be compared, and points not very different from each other found and those areas identified as a peak, valleys or plateau.
If anyone wants the final code in AS3, here it is:
function detectPeaks(values:Array, tolerance:int):void
{
var min:int = int.MIN_VALUE;
var max:int = int.MAX_VALUE;
var lookformax:int = 1;
var maxpos:int = 0;
var minpos:int = 0;
for(var i:int = 0; i < values.length; i++)
{
var v:int = values[i];
if (v > max)
{
max = v;
maxpos = i;
}
if (v < min)
{
min = v;
minpos = i;
}
if (lookformax == 1)
{
if (v < max - tolerance)
{
canvas.graphics.beginFill(0x00FF00);
canvas.graphics.drawCircle(maxpos % stage.stageWidth, (1 - (values[maxpos] / 100)) * stage.stageHeight, 5);
canvas.graphics.endFill();
min = v;
minpos = i;
lookformax = 0;
}
}
else
{
if (v > min + tolerance)
{
canvas.graphics.beginFill(0xFF0000);
canvas.graphics.drawCircle(minpos % stage.stageWidth, (1 - (values[minpos] / 100)) * stage.stageHeight, 5);
canvas.graphics.endFill();
max = v;
maxpos = i;
lookformax = 1;
}
}
}
}