Implementing the exponential function with basic arithmetic operations - function

For the purpose of the exercise, I have to implement the exponential function with the most basic arithmetic operations. I came up with this, where x is the base and y the exponent:
function expAetB() {
product=1;
for (i=0; i<y; i++)
{
product=product*x;
}
return product;
};
However, there are more basic operations than product=product*x;. I should somehow be able to insert instead another for loop which multiply and pass the result, but I can't find a way to do it without falling into an infinite loop.

In the same way that exponentiation is repeated multiplication, so multiplication is simply repeated addition.
Simply create another function mulAetB which does that for you, and watch out for things like negative inputs.
You could go even one more level and define adding in terms of increment and decrement, but that may be overkill.
See, for example, the following program which uses the overkill method of addition:
#include <stdio.h>
static unsigned int add (unsigned int a, unsigned int b) {
unsigned int result = a;
while (b-- != 0) result++;
return result;
}
static unsigned int mul (unsigned int a, unsigned int b) {
unsigned int result = 0;
while (b-- != 0) result = add (result, a);
return result;
}
static unsigned int pwr (unsigned int a, unsigned int b) {
unsigned int result = 1;
while (b-- != 0) result = mul (result, a);
return result;
}
int main (void) {
int test[] = {0,5, 1,9, 2,4, 3,5, 7,2, -1}, *ip = test;
while (*ip != -1) {
printf ("%d + %d = %3d\n" , *ip, *(ip+1), add (*ip, *(ip+1)));
printf ("%d x %d = %3d\n" , *ip, *(ip+1), mul (*ip, *(ip+1)));
printf ("%d ^ %d = %3d\n\n", *ip, *(ip+1), pwr (*ip, *(ip+1)));
ip += 2;
}
return 0;
}
The output of this program shows that the calculations are correct:
0 + 5 = 5
0 x 5 = 0
0 ^ 5 = 0
1 + 9 = 10
1 x 9 = 9
1 ^ 9 = 1
2 + 4 = 6
2 x 4 = 8
2 ^ 4 = 16
3 + 5 = 8
3 x 5 = 15
3 ^ 5 = 243
7 + 2 = 9
7 x 2 = 14
7 ^ 2 = 49
If you really must have it in a single function, it's a simple matter of refactoring the function call to be inline:
static unsigned int pwr (unsigned int a, unsigned int b) {
unsigned int xres, xa, result = 1;
// Catch common cases, simplifies rest of function (a>1, b>0)
if (b == 0) return 1;
if (a == 0) return 0;
if (a == 1) return 1;
// Do power as repeated multiplication.
result = a;
while (--b != 0) {
// Do multiplication as repeated addition.
xres = result;
xa = a;
while (--xa != 0)
result = result + xres;
}
return result;
}

Related

cant understand the calculation of return statement of binary program with recursion in c

Program of binary conversion with recursion
it is working fine but i cant understand the meaning of one statement
Can any one help me to explain following
return (num % 2) + 10 * binary_conversion(num / 2);
while having input of 13
i am lil confused getting like this num =13;
13%2 = 1 + 10 * 6 = 66 , something stupid like calculation
int binary_conversion(int);
int main()
{
int num, bin;
printf("Enter a decimal number: ");
scanf("%d", &num);
bin = binary_conversion(num);
printf("The binary equivalent of %d is %d\n", num, bin);
}
int binary_conversion(int num)
{
if (num == 0)
{
return 0;
}
else
{
return (num % 2) + 10 * binary_conversion(num / 2);
}
}
Your confusions stems from not understanding the operation of recursion. It's time to interview the function with print statements. This will allow you to follow the control and data flow of the routine.
int binary_conversion(int num)
{
printf("ENTER num = %d\n", num);
if (num == 0)
{
printf("BASE CASE returns 0\n");
return 0;
}
else
{
printf("RECURSION: new bit = %d, recur on %d\n", num % 2, num / 2);
return (num % 2) + 10 * binary_conversion(num / 2);
}
}

Optimizing code for reading some VLVs in a file?

I'm trying to read some variable-length-values from a file I created.
The file contains the following:
81 7F 81 01 2F F3 FF
There are two VLVs there, 81 7F and 81 01 which are 255 and 129 in decimal.
I also created some file-reader functions that go like this:
void read_byte_from_file_to(std::fstream& file, uint8_t& to) {
file.read((char*)&to, 1);
}
unsigned long readVLV(std::fstream& t_midi_file) {
unsigned long result = 0;
static unsigned long sum = 0, depth = 0, count = 0;
uint8_t c;
read_byte_from_file_to(t_midi_file, c);
++count;
if (c & 0x80) {
readVLV(t_midi_file);
}
sum += (c & 0x7F) << (7 * depth++);
if (count == depth) {
result = sum;
sum = 0;
depth = 0;
count = 0;
}
return result;
};
While running readVLV n times gives correct answers for the first n VLVs when reading from a file, I absolutely hate how I wrote it, which so much statics parameters and that ugly parameter reset. SO if someone could head me in the right direction I'd be very pleased.
A basic _readVLV which takes the positional state of the function could be done by writing
unsigned long _readVLV(
std::fstream& t_midi_file,
unsigned long sum,
unsigned long depth) {
uint8_t c;
read_byte_from_file_to(t_midi_file, c);
if (c & 0x80) {
sum += _readVLV(t_midi_file, sum, depth);
++depth;
}
return (c & 0x7F) << (7 * depth);
}
and creating a global readVLV function that takes the positional information and the file like so
unsigned long readVLV(std::fstream& t_midi_file) {
unsigned long sum = 0, depth = 0, count = 0;
return _readVLV(t_midi_file, sum, depth, count);
}

50 error: invalid operands to binary % (have' int ' and ' *int ')

I get this error at line 50.I dont know what to use instead of (*p).
I am learning how to use pointers and trying to use pointers in a function passing arguments by reference.
I've been staring at it for some time now.
# include "stdio.h"
int odd (int (*), int );
main(){
int i,n;
int size;
int main(){
int v[i];
int *p;
p = &v[0];
printf("Write the quantity of integers you want to ingress");
scanf("%d",&size);
for(i=0;i<size;i++){
printf("write a number");
scanf("%d",&n);
v[i]= n;
p = &v[i];
odd(&v[i],size);
printf("The value number %d is: %d \n",i,*p);
}
return 0;
}
int odd(int *p,int siz){
int i;
int counter = 0;
for(i=0;i<siz;i++){
/*50*/ if(*p % 2 = 0){ }
else counter++ ;
return counter;
}
}
You are confusing assignment (=) with testing for equality (==). Change:
if(*p % 2 = 0)
to:
if(*p % 2 == 0)
Also your prototype for odd is wrong - change:
int odd (int (*), int );
to:
int odd (int *, int );

Rearranging an array in CUDA

I have the following problem that I want to implement on CUDA:
I want to read an array (say "flag[20]"), and based on a certain condition, write indices of this array to another array (say "pindex[]")
Simple code implementation in C can be:
int N = 20;
int flag[N];
int pindex[N];
for(int i=0;i<N;i++)
flag[i] = -1;
for(int i=0;i<N;i+=2)
flag[i] = 0;
for(int i=0;i<N;i++)
pindex[i] = 0;
//operation: count # of times flag != -1 and write those indices in a different array
int pcount1 = 0;
for(int i=0;i<N;i++)
{
if(flag[i] != -1)
{
pindex[pcount1] = i;
++pcount1;
}
}
How will I implement this in CUDA?
I can use atomicAdd() to calculate total number of times my condition is satisfied. But, how do I write indices in a different array. For example, I tried the following:
__global__ void kernel_tryatomic(int N,int* pcount,int* flag, int* pindex)
{
int tId=threadIdx.x;
int n=(blockIdx.x*2+blockIdx.y)*BlockSize+tId;
if(n > N-1) return;
if(flag[n] != -1)
{
atomicAdd(pcount,1);
atomicExch(&pindex[*pcount],n);
//pindex[*pcount] = n;
}
}
This code calculates "pcount" correctly, but does not update "pindex" array.
I need help to do this operation on GPUs.
Thanks
Since your condition (flag) is conceptually a binary, you can use binary prefix sum (thoroughly explained here) to determine which place the thread with a positive flag should write.
For example if N is 20, with the help of below __device__ functions:
__device__ int lanemask_lt(int lane) {
return (1 << (lane)) − 1;
}
__device__ int warp_prefix_sums(int lane, int p) {
const int mask = lanemask_lt( lane );
int b = __ballot( p );
return __popc( b & mask );
}
your __global__ function can simply be written like below:
__global__ void kernel_scan(int N,int* pcount,int* flag, int* pindex)
{
int tId=threadIdx.x;
if(tId >= N)
return;
int threadFlag = ( flag[tId] == -1 ) ? 0 : 1;
int position_to_write = warp_prefix_sum( tId & (warpSize-1), threadFlag );
if( threadFlag )
pindex[ position_to_write ] = tId;
}
If N is bigger than the warp size (32), you can use intra-block binary prefix sum that is explained in the provided link.

Parallel prefix sum with multiple elements per thread without using thrust

I'm trying to perform an inclusive scan to find the cumulative sum of an array. Following the advice given by harrism here, I'm using the procedure given here, but following the advice of those authors, I'm trying to write code that has each thread calculate 4 elements instead of one to mask memory latency.
I am staying away from thrust as performance is essential, and I need multi-stream capability. I have only just discovered CUB, and that will be my next effort, but I would like a multi-block solution and would also like to know where I've gone wrong on my existing code, just as an exercise to better understand CUDA.
The code below allocates 4 data elements to each block, where each block must have a multiple of 32 threads. My data will have a multiple of 128 threads so this restriction is acceptable to me. Enough shared memory is allocated to each block for the 4*blockDim.x elements plus an additional 32 elements to sum between warps. scanBlockAnyLength then adds the necessary offset to correct mismatch between warps, saving the final value of each warp to dev_blockSum in device global memory. sumWarp4_32 then scans this array to find the final to correct the mismatch between blocks, which is then added on in kernel_sumBlock
#include<cuda.h>
#include<iostream>
using std::cout;
using std::endl;
#define MAX_THREADS 1024
#define MAX_BLOCKS 65536
#define N 512
__device__ float sumWarp4_128(float* ptr, const int tidx = threadIdx.x) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 threads per warp
unsigned int i = warpid*128+lane; //first element of block data set this thread looks at
if( lane >= 1 ) ptr[i] += ptr[i-1];
if( lane >= 2 ) ptr[i] += ptr[i-2];
if( lane >= 4 ) ptr[i] += ptr[i-4];
if( lane >= 8 ) ptr[i] += ptr[i-8];
if( lane >= 16 ) ptr[i] += ptr[i-16];
if( lane==0 ) ptr[i+32] += ptr[i+31];
if( lane >= 1 ) ptr[i+32] += ptr[i+32-1];
if( lane >= 2 ) ptr[i+32] += ptr[i+32-2];
if( lane >= 4 ) ptr[i+32] += ptr[i+32-4];
if( lane >= 8 ) ptr[i+32] += ptr[i+32-8];
if( lane >= 16 ) ptr[i+32] += ptr[i+32-16];
if( lane==0 ) ptr[i+64] += ptr[i+63];
if( lane >= 1 ) ptr[i+64] += ptr[i+64-1];
if( lane >= 2 ) ptr[i+64] += ptr[i+64-2];
if( lane >= 4 ) ptr[i+64] += ptr[i+64-4];
if( lane >= 8 ) ptr[i+64] += ptr[i+64-8];
if( lane >= 16 ) ptr[i+64] += ptr[i+64-16];
if( lane==0 ) ptr[i+96] += ptr[i+95];
if( lane >= 1 ) ptr[i+96] += ptr[i+96-1];
if( lane >= 2 ) ptr[i+96] += ptr[i+96-2];
if( lane >= 4 ) ptr[i+96] += ptr[i+96-4];
if( lane >= 8 ) ptr[i+96] += ptr[i+96-8];
if( lane >= 16 ) ptr[i+96] += ptr[i+96-16];
return ptr[i+96];
}
__host__ __device__ float sumWarp4_32(float* ptr, const int tidx = threadIdx.x) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 elements per warp
unsigned int i = warpid*32+lane; //first element of block data set this thread looks at
if( lane >= 1 ) ptr[i] += ptr[i-1];
if( lane >= 2 ) ptr[i] += ptr[i-2];
if( lane >= 4 ) ptr[i] += ptr[i-4];
if( lane >= 8 ) ptr[i] += ptr[i-8];
if( lane >= 16 ) ptr[i] += ptr[i-16];
return ptr[i];
}
__device__ float sumBlock4(float* ptr, const int tidx = threadIdx.x, const int bdimx = blockDim.x ) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 threads per warp
float val = sumWarp4_128(ptr);
__syncthreads();//should be included
if( tidx==bdimx-1 ) ptr[4*bdimx+warpid] = val;
__syncthreads();
if( warpid==0 ) sumWarp4_32((float*)&ptr[4*bdimx]);
__syncthreads();
if( warpid>0 ) {
ptr[warpid*128+lane] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+32] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+64] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+96] += ptr[4*bdimx+warpid-1];
}
__syncthreads();
return ptr[warpid*128+lane+96];
}
__device__ void scanBlockAnyLength4(float *ptr, float* dev_blockSum, const float* dev_input, float* dev_output, const int idx = threadIdx.x, const int bdimx = blockDim.x, const int bidx = blockIdx.x) {
const unsigned int lane = idx & 31;
const unsigned int warpid = idx >> 5;
ptr[lane+warpid*128] = dev_input[lane+warpid*128+bdimx*bidx*4];
ptr[lane+warpid*128+32] = dev_input[lane+warpid*128+bdimx*bidx*4+32];
ptr[lane+warpid*128+64] = dev_input[lane+warpid*128+bdimx*bidx*4+64];
ptr[lane+warpid*128+96] = dev_input[lane+warpid*128+bdimx*bidx*4+96];
__syncthreads();
float val = sumBlock4(ptr);
__syncthreads();
dev_blockSum[0] = 0.0f;
if( idx==0 ) dev_blockSum[bidx+1] = ptr[bdimx*4-1];
dev_output[lane+warpid*128+bdimx*bidx*4] = ptr[lane+warpid*128];
dev_output[lane+warpid*128+bdimx*bidx*4+32] = ptr[lane+warpid*128+32];
dev_output[lane+warpid*128+bdimx*bidx*4+64] = ptr[lane+warpid*128+64];
dev_output[lane+warpid*128+bdimx*bidx*4+96] = ptr[lane+warpid*128+96];
__syncthreads();
}
__global__ void kernel_sumBlock(float* dev_blockSum, const float* dev_input, float* dev_output ) {
extern __shared__ float ptr[];
scanBlockAnyLength4(ptr,dev_blockSum,dev_input,dev_output);
}
__global__ void kernel_offsetBlocks(float* dev_blockSum, float* dev_arr) {
const int tidx = threadIdx.x;
const int bidx = blockIdx.x;
const int bdimx = blockDim.x;
const int lane = tidx & 31;
const int warpid = tidx >> 5;
if( warpid==0 ) sumWarp4_32(dev_blockSum);
float val = dev_blockSum[warpid];
dev_arr[warpid*128+lane] += val;
dev_arr[warpid*128+lane+32] += val;
dev_arr[warpid*128+lane+64] += val;
dev_arr[warpid*128+lane+96] += val;
}
void scan4( const float input[], float output[]) {
int blocks = 2;
int threadsPerBlock = 64; //multiple of 32
int smemsize = (threadsPerBlock*4+32)*sizeof(float);
float* dev_input, *dev_output;
cudaMalloc((void**)&dev_input,blocks*threadsPerBlock*4*sizeof(float));
cudaMalloc((void**)&dev_output,blocks*threadsPerBlock*4*sizeof(float));
float *dev_blockSum;
cudaMalloc((void**)&dev_blockSum,blocks*sizeof(float));
int offset = 0;
int Nrem = N;
int chunksize;
while( Nrem ) {
chunksize = max(Nrem,blocks*threadsPerBlock*4);
cudaMemcpy(dev_input,(void**)&input[offset],chunksize*sizeof(float),cudaMemcpyHostToDevice);
kernel_sumBlock<<<blocks,threadsPerBlock,smemsize>>>(dev_blockSum,dev_input,dev_output);
kernel_offsetBlocks<<<blocks,threadsPerBlock>>>(dev_blockSum,dev_output);
cudaMemcpy((void**)&output[offset],dev_output,chunksize*sizeof(float),cudaMemcpyDeviceToHost);
offset += chunksize;
Nrem -= chunksize;
}
cudaFree(dev_input);
cudaFree(dev_output);
}
int main() {
float h_vec[N], sol[N];
for( int i = 0; i < N; i++ ) h_vec[i] = (float)i+1.0f;
scan4(h_vec,sol);
cout << "solution:" << endl;
for( int i = 0; i < N; i++ ) cout << i << " " << (i+2)*(i+1)/2 << " " << sol[i] << endl;
return 0;
}
To my eye, the code is throwing errors because the lines in sumWarp4_128 are not executed in order within a warp. I.e, the if( lane==0 ) lines are executing before the other logical blocks that precede it. I thought this was not possible within a warp.
If I __syncthreads() before and after the lane==0 calls, I get some new exotic error that I just can't figure out.
Any help to point out where I've gone wrong would be appreciated
The code you are writing has race conditions due to not synchronizing between threads that are sharing data. While it is true that this can be done on current hardware for communication within a warp (so-called warp-synchronous programming), it is highly discouraged because the race conditions in the code could cause it to fail on possible future hardware.
While it is true that you will get higher performance by processing multiple items per thread, 4 is not a magic number -- you should make this a tunable parameter if possible. CUDPP uses 8 per thread, for example.
I would highly recommend that you use CUB for this. You should use cub::BlockLoad() to load multiple items per thread and cub::BlockScan() to scan them. Then you would just need some code to combine multiple blocks. The most bandwidth-efficient way to do this is to use the "Reduce-Scan-Scan" approach that Thrust uses. First reduce each block (cub::BlockReduce) and store the sum from each block to a blockSums array. Then scan that array to get the per-block offset. Then perform a cub::BlockScan on the blocks and add the previously computed per-block offset to each element.