why I didn't get the expected value from my cuda kernel [duplicate]

why I didn't get the expected value from my cuda kernel [duplicate] - cuda

This question already has answers here:
What is the difference between the float and integer data type when the size is the same?
(3 answers)
Is there any way to reduce sum 100M float elements of an array in CUDA?
(1 answer)
Is floating point math broken?
(31 answers)
Closed 4 months ago.
Hi this is my cuda kernel
__global__ void calculate() {
float a = 1008521344.0;
float b = 3995.0;
float c = 19228.0;
float d = 0;
d = a + b*c;
printf("test result: %.6f\n", d);
}
I got the value from running this kernel:
test result: 1085337216.000000
But i expect the value to be: 1085337204
where is this mismatch from?

Related

Fortran function for geometric series [duplicate]

This question already has answers here:
What is the purpose of result variables in Fortran?
(1 answer)
Does Fortran preserve the value of internal variables through function and subroutine calls?
(3 answers)
Fortran assignment on declaration and SAVE attribute gotcha
(2 answers)
Closed 3 years ago.
I am implementing this equation with a Fortran function.
Write call in function (g) consistently returns 6, but when I call function in program (z) output depends, e.g.,
-2123950080
-529463296
929961984.
Why g and z are not the same? What should I change in function to accomplish geometric series calculation?
function geomSeries(n)
implicit none
integer :: i, n, g = 0, x = 1, geomSeries
g = 1 + x
do i = 2, n
g = g + x**i
end do
write (*,*) g
return
end function geomSeries
program geomFunc
implicit none
integer :: n = 5, geomSeries, z
z = geomSeries(n)
write (*,*) z
end program geomFunc
P.S. Ideally I would like it to be pure function, as I don't see how it produces side effects, but haven't managed to compile it that way (why?).

Why is 134.605*100 not equal 13460.5 in VBA Access? [duplicate]

This question already has answers here:
Compare double in VBA precision problem
(10 answers)
Closed 8 years ago.
I guess it has something to do with the precision but I don't understand it. I have this piece of code that should perform rounding to 2 places (value is the input):
Dim a As Double
Dim b As Double
Dim c As Double
Dim result As Double
a = CDbl(value) * 100
b = a + 0.5 //because Int in the next line performs floor operation, 0.5 is added to ensure proper rounding, example for 123.5: Int(123.5) = 123 but we need Int(123.5 + 0.5) = 124
c = Int(b)
result = c / 100
The thing is, it doesn't work properly for a value 134.605. While debugging I found out that a value is calculated incorrectly.
In fact I have checked this:
?13460.5 = (134.605*100)
False
?(134.605*100) - 13460.5
-1,02318153949454E-12
And I'm stuck. I can either rewrite the rounding function differently, but I don't have an idea for it without *100 multiplication. Or I could find out how to make *100 operation work correctly for all the values. Could anyone try to explain me why it happens and suggest a solution?

Try Decimal data type. Although you cannot explicitly declare a variable of this type, you can convert it to such implicitly by declaring it as Variant and converting using CDec.
Debug.Print 13460.5 = CDec(134.605 * 100) 'Output - True

Parallel Anti diagonal 'for' loop?

I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).
I'm implementing an algorithm which requires the following to be performed:
There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)
I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.
In each iteration, all the elements in that anti-diagonal must run parallelly.
Each anti-diagonal is calculated based on the values of the previous anti-diagonal.
So, how do I index the threads with this requirement in CUDA?

As long as I understand, you want something like
Parallelizing the Smith-Waterman Local Alignment Algorithm using CUDA A
At each iteration, the kernel is launched with a different number of threads.
Perhaps the code in Parallel Anti diagonal 'for' loop could be modified as
int iDivUp(const int a, const int b) { return (a % b != 0) ? (a / b + 1) : (a / b); };
#define BLOCKSIZE 32
__global__ antiparallel(float* d_A, int step, int N) {
int i = threadIdx.x + blockIdx.x* blockDim.x;
int j = step-i;
/* do work on d_A[i*N+j] */
}
for (int step = 0; step < 2*N-1; step++) {
dim3 dimBlock(BLOCKSIZE);
dim3 dimGrid(iDivUp(step,dimBlock.x));
antiparallel<<<dimGrid.x,dimBlock.x>>>(d_A,step,N);
}
This code is untested and is just a sketch of a possible solution (provided that I have not misunderstood your question). Furthermore, I do not know how efficient would be a solution like that since you will have kernels launched with very few threads.

Integer equals 0 instead of 1 [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Actionscript 3 Math inconsistencies
(3 answers)
Closed 8 years ago.
I don't understand why the values in the following code are not equal to 1:
var a:uint = (4.1-1.7)/2.4;
trace(a);//traces 0
var b:int = (4.1-1.7)/2.4;
trace(b);//traces 0
var c:Number = (4.1-1.7)/2.4;
trace(c);//traces 0.9999999999999998

This is because in most languages real, non-integer numbers are stored using floating point representation (http://en.wikipedia.org/wiki/Floating_point), which is inherently susceptible to minor inaccuracies.

How to implement three stacks using a single array

I came across this problem in an interview website. The problem asks for efficiently implement three stacks in a single array, such that no stack overflows until there is no space left in the entire array space.
For implementing 2 stacks in an array, it's pretty obvious: 1st stack grows from LEFT to RIGHT, and 2nd stack grows from RIGHT to LEFT; and when the stackTopIndex crosses, it signals an overflow.
Thanks in advance for your insightful answer.

You can implement three stacks with a linked list:
You need a pointer pointing to the next free element. Three more pointers return the last element of each stack (or null, if the stack is empty).
When a stack gets another element added, it has to use the first free element and set the free pointer to the next free element (or an overflow error will be raised). Its own pointer has to point to the new element, from there back to the next element in the stack.
When a stack gets an element removed it will hand it back into the list of free elements. The own pointer of the stack will be redirected to the next element in the stack.
A linked list can be implemented within an array.
How (space) efficent is this?
It is no problem to build a linked list by using two cells of an array for each list element (value + pointer). Depending on the specification you could even get pointer and value into one array element (e.g. the array is long, value and pointer are only int).
Compare this to the solution of kgiannakakis ... where you lose up to 50% (only in the worst case). But I think that my solution is a bit cleaner (and maybe more academic, which should be no disadvantage for an interview question ^^).

See Knuth, The Art of Computer Programming, Volume 1, Section 2.2.2. titled "Sequential allocation". Discusses allocating multiple queues/stacks in a single array, with algorithms dealing with overflows, etc.

We can use long bit array representing to which stack the i-th array cell belongs to.
We can take values by modulo 3 (00 - empty, 01 - A, 10 - B, 11 - C). It would take N/2 bits or N/4 bytes of additional memory for N sized array.
For example for 1024 long int elements (4096 bytes) it would take only 256 bytes or 6%.
This bit array map can be placed in the same array at the beginning or at the end, just shrinking the size of the given array by constant 6%!

First stack grows from left to right.
Second stack grows from right to left.
Third stack starts from the middle. Suppose odd sized array for simplicity. Then third stack grows like this:
* * * * * * * * * * *
5 3 1 2 4
First and second stacks are allowed to grow maximum at the half size of array. The third stack can grow to fill in the whole array at a maximum.
Worst case scenario is when one of the first two arrays grows at 50% of the array. Then there is a 50% waste of the array. To optimise the efficiency the third array must be selected to be the one that grows quicker than the other two.

This is an interesting conundrum, and I don't have a real answer but thinking slightly outside the box...
it could depend on what each element in the stack consists of. If it's three stacks of true/false flags, then you could use the first three bits of integer elements. Ie. bit 0 is the value for the first stack, bit 1 is the value for the second stack, bit 2 is the value for the third stack. Then each stack can grow independently until the whole array is full for that stack. This is even better as the other stacks can also continue to grow even when the first stack is full.
I know this is cheating a bit and doesn't really answer the question but it does work for a very specific case and no entries in the stack are wasted. I am watching with interest to see whether anyone can come up with a proper answer that works for more generic elements.

Split array in any 3 parts (no matter if you'll split it sequentially or interleaved). If one stack grows greater than 1/3 of array you start filling ends of rest two stacks from the end.
aaa bbb ccc
1 2 3
145 2 3
145 2 6 3
145 2 6 3 7
145 286 3 7
145 286 397
The worse case is when two stacks grows up to 1/3 boundary and then you have 30% of space waste.

Assuming that all array positions should be used to store values - I guess it depends on your definition of efficiency.
If you do the two stack solution, place the third stack somewhere in the middle, and track both its bottom and top, then most operations will continue to be efficient, at a penalty of an expensive Move operation (of the third stack towards wherever free space remains, moving to the half way point of free space) whenever a collision occurs.
It's certainly going to be quick to code and understand. What are our efficiency targets?

A rather silly but effective solution could be:
Store the first stack elements at i*3 positions: 0,3,6,...
Store the second stack elements at i*3+1 positions: 1,4,7...
And third stack elements at i*3+2 positions.
The problem with this solution is that the used memory will be always three times the size of the deepest stack and that you can overflow even when there are available positions at the array.

Make a HashMap with keys to the begin and end positions e.g. < "B1" , 0 >, <"E1" , n/3 >
for each Push(value) add a condition to check if position of Bx is previous to Ex or there is some other "By" in between. -- lets call it condition (2)
with above condition in mind,
if above (2) is true // if B1 and E1 are in order
{ if ( S1.Push()), then E1 ++ ;
else // condition of overflow ,
{ start pushing at end of E2 or E3 (whichever has a space) and update E1 to be E2-- or E3-- ; }
}
if above (2) is false
{ if ( S1.Push()), then E1 -- ;
else // condition of overflow ,
{ start pushing at end of E2 or E3 (whichever has a space) and update E1 to be E2-- or E3-- ; }
}

Assume you only has integer index. if it's treated using FILO (First In Last Out) and not referencing individual, and only using an array as data. Using it's 6 space as stack reference should help:
[head-1, last-1, head-2, last-2, head-3, last-3, data, data, ... ,data]
you can simply using 4 space, because head-1 = 0 and last-3 = array length. If using FIFO (First In First Out) you need to re-indexing.
nb: I’m working on improving my English.

first stack grows at 3n,
second stack grows at 3n+1,
third grows at 3n+2
for n={0...N}

Yet another approach (as additional to linked-list) is to use map of stacks. In that case you'll have to use additional log(3^n)/log(2) bits for building map of data distribution in your n-length array. Each of 3-value part of map says which stack is owns next element.
Ex. a.push(1); b.push(2); c.push(3); a.push(4); a.push(5); will give you image
aacba
54321
appropriate value of map is calculated while elements is pushed onto stack (with shifting contents of array)
map0 = any
map1 = map0*3 + 0
map2 = map1*3 + 1
map3 = map2*3 + 2
map4 = map3*3 + 0
map5 = map4*3 + 0 = any*3^5 + 45
and length of stacks 3,1,1
Once you'll want to do c.pop() you'll have to reorganize your elements by finding actual position of c.top() in original array through walking in cell-map (i.e. divide by 3 while mod by 3 isn't 2) and then shift all contents in array back to cover that hole. While walking through cell-map you'll have to store all position you have passed (mapX) and after passing that one which points to stack "c" you'll have to divide by 3 yet another time and multiply it by 3^(amount positions passed-1) and add mapX to get new value of cells-map.
Overhead for that fixed and depends on size of stack element (bits_per_value):
(log(3n)/log(2)) / (nlog(bits_per_value)/log(2)) = log(3n) / (nlog(bits_per_value)) = log(3) / log(bits_per_value)
So for bits_per_value = 32 it will be 31.7% space overhead and with growing bits_per_value it will decay (i.e. for 64 bits it will be 26.4%).

In this approach, any stack can grow as long as there is any free space in the array.
We sequentially allocate space to the stacks and we link new blocks to the previous block. This means any new element in a stack keeps a pointer to the previous top element of that particular stack.
int stackSize = 300;
int indexUsed = 0;
int[] stackPointer = {-1,-1,-1};
StackNode[] buffer = new StackNode[stackSize * 3];
void push(int stackNum, int value) {
int lastIndex = stackPointer[stackNum];
stackPointer[stackNum] = indexUsed;
indexUsed++;
buffer[stackPointer[stackNum]]=new StackNode(lastIndex,value);
}
int pop(int stackNum) {
int value = buffer[stackPointer[stackNum]].value;
int lastIndex = stackPointer[stackNum];
stackPointer[stackNum] = buffer[stackPointer[stackNum]].previous;
buffer[lastIndex] = null;
indexUsed--;
return value;
}
int peek(int stack) { return buffer[stackPointer[stack]].value; }
boolean isEmpty(int stackNum) { return stackPointer[stackNum] == -1; }
class StackNode {
public int previous;
public int value;
public StackNode(int p, int v){
value = v;
previous = p;
}
}

This code implements 3 stacks in single array. It takes care of empty spaces and fills the empty spaces in between the data.
#include <stdio.h>
struct stacknode {
int value;
int prev;
};
struct stacknode stacklist[50];
int top[3] = {-1, -1, -1};
int freelist[50];
int stackindex=0;
int freeindex=-1;
void push(int stackno, int value) {
int index;
if(freeindex >= 0) {
index = freelist[freeindex];
freeindex--;
} else {
index = stackindex;
stackindex++;
}
stacklist[index].value = value;
if(top[stackno-1] != -1) {
stacklist[index].prev = top[stackno-1];
} else {
stacklist[index].prev = -1;
}
top[stackno-1] = index;
printf("%d is pushed in stack %d at %d\n", value, stackno, index);
}
int pop(int stackno) {
int index, value;
if(top[stackno-1] == -1) {
printf("No elements in the stack %d\n", value, stackno);
return -1;
}
index = top[stackno-1];
freeindex++;
freelist[freeindex] = index;
value = stacklist[index].value;
top[stackno-1] = stacklist[index].prev;
printf("%d is popped put from stack %d at %d\n", value, stackno, index);
return value;
}
int main() {
push(1,1);
push(1,2);
push(3,3);
push(2,4);
pop(3);
pop(3);
push(3,3);
push(2,3);
}

Another solution in PYTHON, please let me know if that works as what you think.
class Stack(object):
def __init__(self):
self.stack = list()
self.first_length = 0
self.second_length = 0
self.third_length = 0
self.first_pointer = 0
self.second_pointer = 1
def push(self, stack_num, item):
if stack_num == 1:
self.first_pointer += 1
self.second_pointer += 1
self.first_length += 1
self.stack.insert(0, item)
elif stack_num == 2:
self.second_length += 1
self.second_pointer += 1
self.stack.insert(self.first_pointer, item)
elif stack_num == 3:
self.third_length += 1
self.stack.insert(self.second_pointer - 1, item)
else:
raise Exception('Push failed, stack number %d is not allowd' % stack_num)
def pop(self, stack_num):
if stack_num == 1:
if self.first_length == 0:
raise Exception('No more element in first stack')
self.first_pointer -= 1
self.first_length -= 1
self.second_pointer -= 1
return self.stack.pop(0)
elif stack_num == 2:
if self.second_length == 0:
raise Exception('No more element in second stack')
self.second_length -= 1
self.second_pointer -= 1
return self.stack.pop(self.first_pointer)
elif stack_num == 3:
if self.third_length == 0:
raise Exception('No more element in third stack')
self.third_length -= 1
return self.stack.pop(self.second_pointer - 1)
def peek(self, stack_num):
if stack_num == 1:
return self.stack[0]
elif stack_num == 2:
return self.stack[self.first_pointer]
elif stack_num == 3:
return self.stack[self.second_pointer]
else:
raise Exception('Peek failed, stack number %d is not allowd' % stack_num)
def size(self):
return len(self.items)
s = Stack()
# push item into stack 1
s.push(1, '1st_stack_1')
s.push(1, '2nd_stack_1')
s.push(1, '3rd_stack_1')
#
## push item into stack 2
s.push(2, 'first_stack_2')
s.push(2, 'second_stack_2')
s.push(2, 'third_stack_2')
#
## push item into stack 3
s.push(3, 'FIRST_stack_3')
s.push(3, 'SECOND_stack_3')
s.push(3, 'THIRD_stack_3')
#
print 'Before pop out: '
for i, elm in enumerate(s.stack):
print '\t\t%d)' % i, elm
#
s.pop(1)
s.pop(1)
#s.pop(1)
s.pop(2)
s.pop(2)
#s.pop(2)
#s.pop(3)
s.pop(3)
s.pop(3)
#s.pop(3)
#
print 'After pop out: '
#
for i, elm in enumerate(s.stack):
print '\t\t%d)' % i, elm

May be this can help you a bit...i wrote it by myself
:)
// by ashakiran bhatter
// compile: g++ -std=c++11 test.cpp
// run : ./a.out
// sample output as below
// adding: 1 2 3 4 5 6 7 8 9
// array contents: 9 8 7 6 5 4 3 2 1
// popping now...
// array contents: 8 7 6 5 4 3 2 1
#include <iostream>
#include <cstdint>
#define MAX_LEN 9
#define LOWER 0
#define UPPER 1
#define FULL -1
#define NOT_SET -1
class CStack
{
private:
int8_t array[MAX_LEN];
int8_t stack1_range[2];
int8_t stack2_range[2];
int8_t stack3_range[2];
int8_t stack1_size;
int8_t stack2_size;
int8_t stack3_size;
int8_t stack1_cursize;
int8_t stack2_cursize;
int8_t stack3_cursize;
int8_t stack1_curpos;
int8_t stack2_curpos;
int8_t stack3_curpos;
public:
CStack();
~CStack();
void push(int8_t data);
void pop();
void print();
};
CStack::CStack()
{
stack1_range[LOWER] = 0;
stack1_range[UPPER] = MAX_LEN/3 - 1;
stack2_range[LOWER] = MAX_LEN/3;
stack2_range[UPPER] = (2 * (MAX_LEN/3)) - 1;
stack3_range[LOWER] = 2 * (MAX_LEN/3);
stack3_range[UPPER] = MAX_LEN - 1;
stack1_size = stack1_range[UPPER] - stack1_range[LOWER];
stack2_size = stack2_range[UPPER] - stack2_range[LOWER];
stack3_size = stack3_range[UPPER] - stack3_range[LOWER];
stack1_cursize = stack1_size;
stack2_cursize = stack2_size;
stack3_cursize = stack3_size;
stack1_curpos = stack1_cursize;
stack2_curpos = stack2_cursize;
stack3_curpos = stack3_cursize;
}
CStack::~CStack()
{
}
void CStack::push(int8_t data)
{
if(stack3_cursize != FULL)
{
array[stack3_range[LOWER] + stack3_curpos--] = data;
stack3_cursize--;
} else if(stack2_cursize != FULL) {
array[stack2_range[LOWER] + stack2_curpos--] = data;
stack2_cursize--;
} else if(stack1_cursize != FULL) {
array[stack1_range[LOWER] + stack1_curpos--] = data;
stack1_cursize--;
} else {
std::cout<<"\tstack is full...!"<<std::endl;
}
}
void CStack::pop()
{
std::cout<<"popping now..."<<std::endl;
if(stack1_cursize < stack1_size)
{
array[stack1_range[LOWER] + ++stack1_curpos] = 0;
stack1_cursize++;
} else if(stack2_cursize < stack2_size) {
array[stack2_range[LOWER] + ++stack2_curpos] = 0;
stack2_cursize++;
} else if(stack3_cursize < stack3_size) {
array[stack3_range[LOWER] + ++stack3_curpos] = 0;
stack3_cursize++;
} else {
std::cout<<"\tstack is empty...!"<<std::endl;
}
}
void CStack::print()
{
std::cout<<"array contents: ";
for(int8_t i = stack1_range[LOWER] + stack1_curpos + 1; i <= stack1_range[UPPER]; i++)
std::cout<<" "<<static_cast<int>(array[i]);
for(int8_t i = stack2_range[LOWER] + stack2_curpos + 1; i <= stack2_range[UPPER]; i++)
std::cout<<" "<<static_cast<int>(array[i]);
for(int8_t i = stack3_range[LOWER] + stack3_curpos + 1; i <= stack3_range[UPPER]; i++)
std::cout<<" "<<static_cast<int>(array[i]);
std::cout<<"\n";
}
int main()
{
CStack stack;
std::cout<<"adding: ";
for(uint8_t i = 1; i < 10; i++)
{
std::cout<<" "<<static_cast<int>(i);
stack.push(i);
}
std::cout<<"\n";
stack.print();
stack.pop();
stack.print();
return 0;
}

Perhaps you can implement N number of stacks or queues in the single array. My defination of using single array is that we are using single array to store all the data of all the stacks and queues in the single array, anyhow we can use other N array to keep track of indices of all elements of particular stack or queue.
solution :
store data sequentially to in the array during the time of insertion in any of the stack or queue. and store it's respective index to the index keeping array of that particular stack or queue.
for eg : you have 3 stacks (s1, s2, s3) and you want to implement this using a single array (dataArray[]). Hence we will make 3 other arrays (a1[], a2[], a3[]) for s1, s2 and s3 respectively which will keep track of all of their elements in dataArray[] by saving their respective index.
insert(s1, 10) at dataArray[0] a1[0] = 0;
insert(s2, 20) at dataArray[1] a2[0] = 1;
insert(s3, 30) at dataArray[2] a3[0] = 2;
insert(s1, 40) at dataArray[3] a1[1] = 3;
insert(s3, 50) at dataArray[4] a3[1] = 4;
insert(s3, 60) at dataArray[5] a3[2] = 5;
insert(s2, 30) at dataArray[6] a2[1] = 6;
and so on ...
now we will perform operation in dataArray[] by using a1, a2 and a3 for respective stacks and queues.
to pop an element from s1
return a1[0]
shift all elements to left
do similar approach for other operations too and you can implement any number of stacks and queues in the single array.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

why I didn't get the expected value from my cuda kernel [duplicate] - cuda

Related

Fortran function for geometric series [duplicate]

Why is 134.605*100 not equal 13460.5 in VBA Access? [duplicate]

Parallel Anti diagonal 'for' loop?

Integer equals 0 instead of 1 [duplicate]

How to implement three stacks using a single array

Categories

Resources