Operating System: Its beggining the my lecture but I couldn't write what was asked of me - binary

Write a program in which a global variable count is initialized to 0.
Then, create two threads.
In both threads, let there be a loop from 1 to N, and increment count in the loop.
Try to see if count is equal to 2N for various values of N (e.g. N=10,000).
For each N, you should run your program multiple times.
Comment on your results.

Related

How to copy a small part of an array from global memory into the local memory of a thread

I have an nxn matrix A (which is stored row by row in an array) with floats. n is divisible by 4. I now want to output a matrix B which is (n/4)x(n/4). In B i want to store the arithmetic middle of all the values of a 4 by 4 sub matrix in A. The submatrixes do not overlap. E.g. If A were 16x16, B would be 4x4, since A has 4 submatrixes.
The thing is, that i want to you use 1 thread per submatrix and i want to have the memory access as coalescent as possible. My idea was to create a kernel and in thread 0 read the values A[0],A[1],A[2],A[3] and in thread 2 A[4],A[5],A[6],A[7]. The thing is, that if i execute these reads after one another (this code only works for the first row of the matrix) e.g.:
float * firstRow[4];
for (int i = 0; i<4; i++):
firstRow[i] = A[threadIdx.x * 4 + i];
the reads would not be coalescent. i would need something like cudaMemcpy but from global memory to a threads memory.
I am not sure if i could copy the whole array into shared memory since it is 4 times bigger than the number of threads. Nevertheless, i want to know the answer to my original question.

Number of Threads Calculating a single Value

I am using CUDA with Compute capability 1.2. I am running my CUDA Code with each element of a matrix calculated by the addtion of other 2 matrices. I am calculating the value of one element by one Thread. I want to know is it possible to use 2 threads for calculating single value.If it is possible, Can anyone plz tell me how to use 2 different threads of same block to calculate the single value?
If you need to calculate
q = m2[i][k] + m2[(k+1)][j] + p1[(i-1)]*p1[k]*p1[j];
by two cores, use a wider variable + less iterations. int2:
__shared__ int2 m2[N][N],p1[N],q;
could use two cores but not two threads. If you insist on two threads,
qThread1 = m2[i][k] + m2[(k+1)][j] //in a kernel
...
...
...
qThread2 = p1[(i-1)]*p1[k]*p1[j] //in another kernel
Then you simply add them into q by another thread. Synchronizations, kernel starting overheads, cache utilization can decrease performance as well as decreased instruction level parallelizm. Maybe kernel occupation increases but not sure if it tolerates the above negatives.

CUDA Atomic operation on array in global memory

I have a CUDA program whose kernel basically does the following.
I provide a list of n points in cartesian coordinates e.g. (x_i,y_i) in a plane of dimension dim_x * dim_y. I invoke the kernel accordingly.
For every point on this plane (x_p,y_p) I calculate by a formula the time it would take for each of those n points to reach there; given those n points are moving with a certain velocity.
I order those times in increasing order t_0,t_1,...t_n where the precision of t_i is set to 1. i.e. If t'_i=2.3453 then I would only use t_i=2.3.
Assuming the times are generated from a normal distribution I simulate the 3 quickest times to find the percentage of time those 3 points reached earliest. Hence suppose prob_0 = 0.76,prob_1=0.20 and prob_2=0.04 by a random experiment. Since t_0 reaches first most amongst the three, I also return the original index (before sorting of times) of the point. Say idx_0 = 5 (An integer).
Hence for every point on this plane I get a pair (prob,idx).
Suppose n/2 of those points are of one kind and the rest are of other. A sample image generated looks as follows.
Especially when precision of the time was set to 1 I noticed that the number of unique 3 tuples of time (t_0,t_1,t_2) was just 2.5% of the total data points i.e. number of points on the plane. This meant that most of the times the kernel was uselessly simulating when it could just use the values from previous simulations. Hence I could use a dictionary having key as 3-tuple of times and value as index and prob. Since as far as I know and tested, STL can't be accessed inside a kernel, I constructed an array of floats of size 201000000. This choice was by experimentation since none of the top 3 times exceeded 20 seconds. Hence t_0 could take any value from {0.0,0.1,0.2,...,20.0} thus having 201 choices. I could construct a key for such a dictionary like the following
Key = t_o * 10^6 + t_1 * 10^3 + t_2
As far as the value is concerned I could make it as (prob+idx). Since idx is an integer and 0.0<=prob<=1.0, I could retrieve both of those values later by
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
So now my kernel looks like the following
__global__ my_kernel(float* points,float* dict,float *p,float *i,size_t w,...){
unsigned int col = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
//Calculate time taken for each of the points to reach a particular point on the plane
//Order the times in increasing order t_0,t_1,...,t_n
//Calculate Key = t_o * 10^6 + t_1 * 10^3 + t_2
if(dict[key]>0.0){
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
}
else{
//Simulate and find prob and idx
dict[key]=(prob+idx)
}
p[row*width+col]=prob;
i[row*width+col]=idx;
}
The result is quite similar to the original program for most points but for some it is wrong.
I am quite sure that this is due to race condition. Notice that dict was initialized with all zeroes. The basic idea would be to make the data structure "read many write once" in a particular location of the dict.
I am aware that there might be much more optimized ways of solving this problem rather than allocating so much memory. Please let me know in that case. But I would really like to understand why this particular solution is failing. In particular I would like to know how to use atomicAdd in this setting. I have failed to use it.
Unless your simulation in the else branch is very long (~100s of floating-point operations), a lookup table in global memory is likely to be slower than running the computation. Global memory access is very expensive!
In any case, there is no way to save time by "skipping work" using conditional branching. The Single Instruction, Multiple Thread architecture of a GPU means that the instructions for both sides of the branch will be executed serially, unless all of the threads in a block follow the same branch.
edit:
The fact that you are seeing a performance increase as a result of introducing the conditional branch and you didn't have any problems with deadlock suggests that all the threads in each block are always taking the same branch. I suspect that once dict starts getting populated, the performance increase will go away.
Perhaps I have misunderstood something, but if you want to calculate the probability of an event x, assuming a normal distribution and given the mean mu and standard deviation sigma, there is no need to generate a load of random numbers and approximate a Gaussian curve. You can directly calculate the probability:
p = exp(-((x - mu) * (x - mu) / (2.0f * sigma * sigma))) /
(sigma * sqrt(2.0f * M_PI));

Dynamic vs Static instruction count

What is the difference between dynamic and static instruction count?
a. Derive an expression to calculate the user CPU time as a function
of following parameters: the dynamic instruction count (N),
clock cycle per instruction (CPI) and clock frequency (f)
b. Explain the reason for choosing ‘dynamic’ instruction count as a parameter in Question 3a
instead of ‘static’ instruction count
The dynamic instruction count is the actual number of instructions executed by the CPU for a specific program execution, whereas the static instruction count is the number of instruction the program has.
We usually use dynamic instruction count as if for example you have a loop in your program then some instructions get executed more than once. Also, in the presence of branches, some instructions may not be executed at all.
Execution time (ET) = clock cycles per instruction(CPI) * number of instruction(IC) * cycle duration (CD).
Since a cycle frequency/rate (CR) is simply the inverse of cycle duration I.e cycles per second- and vice versa
ET= (CPI *IC)/CR

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.