cudaMemset() - does it set bytes or integers?

cudaMemset() - does it set bytes or integers? - cuda

From online documentation:
cudaError_t cudaMemset (void * devPtr, int value, size_t count )
Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
Parameters:
devPtr - Pointer to device memory
value - Value to set for each byte of specified memory
count - Size in bytes to set
This description doesn't appear to be correct as:
int *dJunk;
cudaMalloc((void**)&dJunk, 32*(sizeof(int));
cudaMemset(dJunk, 0x12, 32);
will set all 32 integers to 0x12, not 0x12121212. (Int vs Byte)
The description talks about setting bytes. Count and Value are described in terms of bytes. Notice count is of type size_t, and value is of type int. i.e. Set a byte-size to an int-value.
cudaMemset() is not mentioned in the prog guide.
I have to assume the behavior I am seeing is correct, and the documentation is bad.
Is there a better documentation source out there? (Where?)
Are other types supported? i.e. Would float *dJunk; work? Others?

The documentation is correct, and your interpretation of what cudaMemset does is wrong. The function really does set byte values. Your example sets the first 32 bytes to 0x12, not all 32 integers to 0x12, viz:
#include <cstdio>
int main(void)
{
const int n = 32;
const size_t sz = size_t(n) * sizeof(int);
int *dJunk;
cudaMalloc((void**)&dJunk, sz);
cudaMemset(dJunk, 0, sz);
cudaMemset(dJunk, 0x12, 32);
int *Junk = new int[n];
cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d %x\n", i, Junk[i]);
}
cudaDeviceReset();
return 0;
}
produces
$ nvcc memset.cu
$ ./a.out
0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
ie. all 128 bytes set to 0, then first 32 bytes set to 0x12. Exactly as described by the documentation.

Related

2D threads in CUDA

I'm trying to use 2D threads in CUDA. threadIDx.x and blockIdx.x work fine, but threadIdx.y and blockIdx.y don't work. The .y ones are always 0.
Here is my code:
#define N 16
__global__ void add(int* a) {
int i=threadIdx.x;
int j=threadIdx.y;
a[i] = j;
}
int main(int argc, char **argv)
{
int a[N];
const int size = N*sizeof(int);
int *da;
cudaMalloc((void**)&da, size);
add<<<1, N>>>(da);
cudaMemcpy(a, da, size, cudaMemcpyDeviceToHost);
printf("Thread indices:\n");
for(int i=0;i<N;i++)
{
printf("%d ", a[i]);
}
cudaFree(da);
return 0;
}
The result for a[i] = j; or a[j] = j;
Thread indices:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
and for a[i] = i;
Thread indices:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I tried using
#define M 4
#define N 4
...
int i = (blockDim.x * blockIdx.x) + threadIdx.x;
int j = (blockDim.y * blockIdx.y) + threadIdx.y;
...
add<<<M, N>>>(da);
...
and result is same: .x ones are fine but .y ones are all 0. Can anyone help me fixing this? Thanks

You are confusing blocks and threads with dimensions.
add <<<M,N>>> is interpreted as add<<<dim3(M,1,1),dim3(N,1,1)>>> where M is the number of blocks and N is the number of threads per kernel.
If you want to have MxN blocks with MxN threads call add<<<dim3(M,N),dim3(M,N)>>>
I would recommend Udacity CUDA course for beginners, it is very beginner friendly.
I want M blocks with N threads per block.
Well then add<<<M,N>>> is correct but it is 1 dimensional, there is no y to it. If you want to locate the thread use this code.
int index = threadIdx.x + blockDim.x * blockIdx.x
There is no y in it. The entire thing is 1D. Each block can only have a limited number of threads (64 or 128 usually) that is why threads and blocks are separated. There are a lot of nuances to it. I would recommend the Udacity course it helped me a lot.

thrust: fill isolate space

I have an array like this:
0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
I want every non-zero elements to expand themselves one element at a time until it reaches other non-zero elements, the result is like this:
1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
Is there any way to do this using thrust?

Is there any way to do this using thrust?
Yes, here is one possible approach.
For each position in the sequence, compute 2 distances. The first is the distance to the nearest non-zero value in the left direction, and the second is the distance to the nearest non-zero value in the right direction. If the position itself is non-zero, both left and right distances will be computed as zero. Our basic engine for this will be segmented inclusive scans, one computed in the left to right direction (to compute the distance from the left for each zero segment), and the other computed in the reverse direction (to compute the distance from the right for each zero segment). Using your example:
a vector: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
a left dist: ? ? ? 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2
a right dist:3 2 1 0 4 3 2 1 0 2 1 0 3 2 1 0 ? ?
Note that in each distance computation, we must special-case one end if that end does not happen to begin with a non-zero value (because the distance from that direction is "undefined"). We will special case those ? distances by assigning them large values, the reason for which will become evident in the next step.
We now will create a "map" vector, which, for each output position, allows us to select an element from the original input vector that belongs in that output position. This map vector is computed by taking the lesser of the two computed distances, and adjusting the index either from the left or the right, by that distance:
output index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
a left dist: ? ? ? 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2
a right dist: 3 2 1 0 4 3 2 1 0 2 1 0 3 2 1 0 ? ?
map vector: 3 3 3 3 3 3 8 8 8 8 11 11 11 11 15 15 15 15
For the map vector computation, if a left dist > a right dist then we take the output index and add a right dist to it, to produce the map vector element at that position. Otherwise, we take the output index and subtract a left dist from it. Note that the special-case ? entries above should be considered to be "arbitrarily large" for this computation. This is simulated in the code by using a large integer (1<<30).
Once we have the map vector, it's a trivial matter to use it to do a mapped copy from input to output vectors:
a vector: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
map vector: 3 3 3 3 3 3 8 8 8 8 11 11 11 11 15 15 15 15
out vector: 1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
Here is a fully worked example:
$ cat t610.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/scan.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <iostream>
#define IVAL (1<<30)
// used to create input vector for prefix sums (distance vector computation)
struct is_zero {
template <typename T>
__host__ __device__
T operator() (T val) {
return (val) ? 0:1;
}
};
// inc and dec help with special casing of left and right ends
struct inc {
template <typename T>
__host__ __device__
T operator() (T val) {
return val+IVAL;
}
};
struct dec {
template <typename T>
__host__ __device__
T operator() (T val) {
return val-IVAL;
}
};
// this functor is lifted from thrust example code
// and is used to enable segmented scans based on flag delimitors
// BinaryPredicate for the head flag segment representation
// equivalent to thrust::not2(thrust::project2nd<int,int>()));
template <typename HeadFlagType>
struct head_flag_predicate : public thrust::binary_function<HeadFlagType,HeadFlagType,bool>
{
__host__ __device__
bool operator()(HeadFlagType left, HeadFlagType right) const
{
return !right;
}
};
// distance tuple ordering is left (0), then right (1)
struct map_functor
{
template <typename T>
__host__ __device__
int operator() (T dist){
int leftdist = thrust::get<0>(dist);
int rightdist = thrust::get<1>(dist);
int idx = thrust::get<2>(dist);
return (leftdist > rightdist) ? (idx+rightdist):(idx-leftdist);
}
};
int main(){
int h_a[] = { 0, 0, 0, 1, 0, 0, 0, 0, 5, 0, 0, 3, 0, 0, 0, 8, 0, 0 };
int n = sizeof(h_a)/sizeof(h_a[0]);
thrust::device_vector<int> a(h_a, h_a+n);
thrust::device_vector<int> az(n);
thrust::device_vector<int> asl(n);
thrust::device_vector<int> asr(n);
thrust::transform(a.begin(), a.end(), az.begin(), is_zero());
// set up distance from the left vector (asl)
thrust::transform_if(az.begin(), az.begin()+1, a.begin(), az.begin(),inc(), is_zero());
thrust::transform(a.begin(), a.begin()+1, a.begin(), inc());
thrust::inclusive_scan_by_key(a.begin(), a.end(), az.begin(), asl.begin(), head_flag_predicate<int>());
thrust::transform(a.begin(), a.begin()+1, a.begin(), dec());
thrust::transform_if(az.begin(), az.begin()+1, a.begin(), az.begin(), dec(), is_zero());
// set up distance from the right vector (asr)
thrust::device_vector<int> ra(n);
thrust::sequence(ra.begin(), ra.end(), n-1, -1);
thrust::transform_if(az.end()-1, az.end(), a.end()-1, az.end()-1, inc(), is_zero());
thrust::transform(a.end()-1, a.end(), a.end()-1, inc());
thrust::inclusive_scan_by_key(thrust::make_permutation_iterator(a.begin(), ra.begin()), thrust::make_permutation_iterator(a.begin(), ra.end()), thrust::make_permutation_iterator(az.begin(), ra.begin()), thrust::make_permutation_iterator(asr.begin(), ra.begin()), head_flag_predicate<int>());
thrust::transform(a.end()-1, a.end(), a.end()-1, dec());
// create combined map vector
thrust::device_vector<int> map(n);
thrust::counting_iterator<int> idxbegin(0);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(asl.begin(), asr.begin(), idxbegin)), thrust::make_zip_iterator(thrust::make_tuple(asl.end(), asr.end(), idxbegin+n)), map.begin(), map_functor());
// use map to create output
thrust::device_vector<int> result(n);
thrust::copy(thrust::make_permutation_iterator(a.begin(), map.begin()), thrust::make_permutation_iterator(a.begin(), map.end()), result.begin());
// display results
std::cout << "Input vector:" << std::endl;
thrust::copy(a.begin(), a.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "Output vector:" << std::endl;
thrust::copy(result.begin(), result.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
$ nvcc -arch=sm_20 -o t610 t610.cu
$ ./t610
Input vector:
0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
Output vector:
1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
$
Notes:
The above implementation probably has areas that can be improved on, particularly with respect to fusion of operations. However, for understanding purposes, I think fusion makes the code a bit harder to read.
I have really only tested it on the particular example you gave. There may be bugs that you will uncover. My purpose is not to give you a black-box library function that you use but don't understand, but rather to teach you how to write your own code that does what you want.
The "ambiguity" pointed out by JackOLantern is still present in your problem statement. I have obscured it by choosing my map functor behavior to mimic the output you indicated as desired, but simply by creating an equally valid but opposite realization of the map functor (using "if a left dist < a right dist then ..." instead) I can cause the result between 3 and 8 to take the other possible outcome/state. Your comment that "if there is an ambiguity, whoever reaches the position first fill its value to that space" makes no sense to me, unless by that you mean "I don't care which outcome you provide." There is no concept of a particular thread reaching a particular point first. Threads (and blocks) can execute in any order, and this order can change from device to device, and run to run.

thread work if previously thread finished work (cuda) in same block

hello I am a beginner in cuda programming.I use lock.lock () function to wait for previously thread finished work. this my code :
#include "book.h"
#include <cuda.h>
#include <conio.h>
#include <iostream>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
#include <math.h>
#include <fstream>
#include <string>
#include <curand.h>
#include <curand_kernel.h>
#include "lock.h"
#define pop 10
#define gen 10
#define pg pop*gen
using namespace std;
__global__ void hold(Lock lock,float* a )
{
__shared__ int cache[gen];
int tid=blockIdx.x * blockDim.x+threadIdx.x;
int cacheIndex = threadIdx.x;
if(tid<gen)
{
a[tid]=7;//this number example but in my chase this random number
}
else
{
//cache[cacheIndex]=a[tid];
int temp;
if(tid%gen==0)
{
a[tid]=tid+4;//this example number but in my chase this random number if tid==tid%gen
temp=a[tid];
tid+=blockIdx.x*gridDim.x;
}
else
{
__syncthreads();
a[tid]=temp+1;//this must a[tid]=a[tid-1]+1;
temp=a[tid];
tid+=blockIdx.x*gridDim.x;
}
cache[cacheIndex]=temp;
__syncthreads();
for (int i=0;i<gen;i++)
{
if(cacheIndex==i)
{
lock. lock();
cache[cacheIndex]=temp;
lock.unlock();
}
}
}
}
int main()
{
float time;
float* a=new float [pg];
float *dev_a;
HANDLE_ERROR( cudaMalloc( (void**)&dev_a,pg *sizeof(int) ) );
Lock lock;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
hold<<<pop,gen>>>(lock,dev_a);
HANDLE_ERROR( cudaMemcpy( a, dev_a,pg * sizeof(float),cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
for(int i=0;i<pop;i++)
{
for(int j=0;j<gen;j++)
{
cout<<a[(i*gen)+j]<<" ";
}
cout<<endl;
}
printf("hold: %3.1f ms \n", time);
HANDLE_ERROR(cudaFree(dev_a));
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );
system("pause");
return 0;
}
and this the result :
7 7 7 7 7 7 7 7 7 7
14 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0
44 0 0 0 0 0 0 0 0 0
54 0 0 0 0 0 0 0 0 0
64 0 0 0 0 0 0 0 0 0
74 0 0 0 0 0 0 0 0 0
84 0 0 0 0 0 0 0 0 0
94 0 0 0 0 0 0 0 0 0
my expected result :
7 7 7 7 7 7 7 7 7 7
14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 23 31 32 33
34 35 36 37 38 39 40 41 42 43
44 45 46 47 48 49 50 51 52 53
54 55 56 57 58 59 60 61 62 63
64 65 66 67 68 69 70 71 72 73
74 75 76 77 78 79 80 81 82 83
84 85 86 87 88 89 90 91 92 93
94 95 96 97 98 99 100 101 102 103
any one please help me to correct my code. thanks

If you want help, it would be useful to point out that some of your code (e.g. lock.h and book.h) come from the CUDA by examples book. This is not a standard part of CUDA, so if you don't indicate where it comes from, it may be confusing.
I see the following issues in your code:
You are using a __syncthreads() in a conditional block where not all threads will meet the __syncthreads() barrier:
if(tid%gen==0)
{
...
}
else
{
__syncthreads(); // illegal
}
The usage of __syncthreads() in this way is illegal because not all threads will be able to reach the __syncthreads() barrier:
__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
You are using the temp local variable without initializing it first:
a[tid]=temp+1;//this must a[tid]=a[tid-1]+1;
note that temp is thread-local variable. It is not shared amongst threads. Therefore the above line of code (for threads in the else block) is using an unitialized value of temp.
The remainder of your kernel code:
cache[cacheIndex]=temp;
__syncthreads();
for (int i=0;i<gen;i++)
{
if(cacheIndex==i)
{
lock. lock();
cache[cacheIndex]=temp;
lock.unlock();
}
}
}
does nothing useful because it is updating shared memory locations (i.e. cache) which are never transferred back to the dev_a variable, i.e. global memory. Therefore none of this code could affect the results you print out.
It's difficult to follow what you are trying to accomplish in your code. However if you change this line (the uninitialized value):
int temp;
to this:
int temp=tid+3;
Your code will print out the data according to what you have shown.

Expected number of bank conflicts in shared memory at random access

Let A be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...

In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.
It is possible to tackle though if you build up the solution recursively and use conditional probabilities.
Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.
In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.
Apologies for the VBA code, but VBA was all I had available when motivated to answer :).
Function nCr#(ByVal n#, ByVal r#)
Static combin#()
Static size#
Dim i#, j#
If n = r Then
nCr = 1
Exit Function
End If
If n > size Then
ReDim combin(0 To n, 0 To n)
combin(0, 0) = 1
For i = 1 To n
combin(i, 0) = 1
For j = 1 To i
combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
Next
Next
size = n
End If
nCr = combin(n, r)
End Function
Function p_binom#(n#, r#, p#)
p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function
Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
bucket#, total_buckets#, bucket_capacity#)
If balls > bucket_capacity Then
p_next_bucket_balls = 0
Else
p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
End If
End Function
Function p_capped_buckets#(n#, cap#)
Dim p_prior, p_update
Dim bucket#, balls#, prior_balls#
ReDim p_prior(0 To n)
ReDim p_update(0 To n)
p_prior(0) = 1
For bucket = 1 To n
For balls = 0 To n
p_update(balls) = 0
For prior_balls = 0 To balls
p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
Next
Next
p_prior = p_update
Next
p_capped_buckets = p_update(n)
End Function
Function expected_max_buckets#(n#)
Dim cap#
For cap = 0 To n
expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
Next
End Function
Sub test32()
Dim p_cumm#(0 To 32)
Dim cap#
For cap# = 0 To 32
p_cumm(cap) = p_capped_buckets(32, cap)
Next
For cap = 1 To 32
Debug.Print " ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
Next
End Sub
For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.
Output to compare to ahmad's:
1 0.000000
2 0.029273
3 0.516311
4 0.361736
5 0.079307
6 0.011800
7 0.001417
8 0.000143
9 0.000012
10 0.000001
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 0.000000
17 0.000000
18 0.000000
19 0.000000
20 0.000000
21 0.000000
22 0.000000
23 0.000000
24 0.000000
25 0.000000
26 0.000000
27 0.000000
28 0.000000
29 0.000000
30 0.000000
31 0.000000
32 0.000000

I'll try a math answer, although I don't have it quite right yet.
You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__ array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"
If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).
For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000
int main(){
int i=0,j=0;
int *bank=(int*)malloc(sizeof(int)*NBANK);
int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
memset(maxCount, 0, sizeof(int)*(NBANK+1));
for (int i=0; i<NSAMPLE; ++i) {
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(bank, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
bank[randomNumber[j]]++;
}
for(j=0; j<WARPSIZE; j++)
max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);
// store statistic
maxCount[max_bank_conflict]++;
}
// report statistic
printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
float expected = 0;
for(i=1; i<NBANK+1; i++) {
float prob = maxCount[i]/(float)NSAMPLE;
printf("%02d -> %6.4f\n", i, prob);
expected += prob * i;
}
printf("Expected maximum bank conflict degree = %6.4f\n", expected);
return 0;
}
Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i)), for i from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.
Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.
This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.

I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32
#define NSAMPLE 10000
int main(){
srand ( time(NULL) );
int i=0,j=0;
int *conflictCheck=NULL;
int *randomNumber=NULL;
int *statisticCheck=NULL;
conflictCheck=(int*)malloc(sizeof(int)*NBANK);
randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
while(i<NSAMPLE){
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(conflictCheck, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
conflictCheck[randomNumber[j]]++;
max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
}
// store statistic
statisticCheck[max_bank_conflict]++;
// next iter
i++;
}
// report statistic
printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
for(i=0; i<NBANK+1; i++){
//
printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
}
return 0;
}
I got following output:
Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).

Binary divisibility by 10

How to check if a binary number can be divided by 10 (decimal), without converting it to other system.
For example, we have a number:
1010 1011 0100 0001 0000 0100
How we can check that this number is divisible by 10?

First split the number into odd and even bits (I'm calling "even" the
bits corresponding to even powers of 2):
100100110010110000000101101110
0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 even 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 odd
Now in each of these, add and subtract the digits alternately, as in
the standard test for divisibility by 11 in decimal (starting with
addition at the right):
100100110010110000000101101110 +0-1+0-1+0-0+1-0+0-0+1-1+0-1+0 =
-2 +1-0+0-1+0-1+1-0+0-0+0-0+1-1+1 = 1
Now double the sum of the odd digits and add it to the sum of the even
digits:
2*1 + -2 = 0
If the result is divisible by 5, as in this case, the number itself is
divisible by 5.
Since this number is also divisible by 2 (the rightmost digit being
0), it is divisible by 10.
Link

If you are talking about computational methods, you can do a divisiblity-by-5 test and a divisibility-by-2 test.
The numbers below assume unsigned 32-bit arithmetic, but can easily be extended to larger numbers.
I'll provide some code first, followed by a more textual explanation:
unsigned int div5exact(unsigned int n)
{
// returns n/5 as long as n actually divides 5
// (because 'n * (INV5 * 5)' == 'n * 1' mod 2^32
#define INV5 0xcccccccd
return n * INV5;
}
unsigned int divides5(unsigned int n)
{
unsigned int q = div5exact(n);
if (q <= 0x33333333) /* q*5 < 2^32? */
{
/* q*5 doesn't overflow, so n == q*5 */
return 1;
}
else
{
/* q*5 overflows, so n != q*5 */
return 0;
}
}
int divides2(unsigned int n)
{
/* easy divisibility by 2 test */
return (n & 1) == 0;
}
int divides10(unsigned int n)
{
return divides2(n) && divides5(n);
}
/* fast one-liner: */
#define DIVIDES10(n) ( ((n) & 1) == 0 && ((n) * 0xcccccccd) <= 0x33333333 )
Divisibility by 2 is easy: (n&1) == 0 means that n is even.
Divisibility by 5 involves multiplying by the inverse of 5, which is 0xcccccccd (because 0xcccccccd * 5 == 0x400000001, which is just 0x1 if you truncate to 32 bits).
When you multiply n*5 by the inverse of 5, you get n * 5*(inverse of 5), which in 32-bit math simplifies to n*1 .
Now let's say n and q are 32-bit numbers, and q = n*(inverse of 5) mod 232.
Because n is no greater than 0xffffffff, we know that n/5 is no greater than (232-1)/5 (which is 0x33333333). Therefore, we know if q is less than or equal to (232-1)/5, then we know n divides exactly by 5, because q * 5 doesn't get truncated in 32 bits, and is therefore equal to n, so n divides q and 5.
If q is greater than (232-1)/5, then we know it doesn't divide 5, because there is a one-one mapping between the 32-bit numbers divisible by 5 and the numbers between 0 and (232-1)/5, and so any number out of this range doesn't map to a number that's divisible by 5.

Here is the code in python to check the divisibilty by 10 using bitwise technique
#taking input in string which is a binary number eg: 1010,1110
s = input()
#taking initial value of x as o
x = 0
for i in s:
if i == '1':
x = (x*2 + 1) % 10
else:
x = x*2 % 10
#if x is turn to be 0 then it is divisible by 10
if x:
print("Not divisible by 10")
else:
print("Divisible by 10")

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

cudaMemset() - does it set bytes or integers? - cuda

Related

2D threads in CUDA

thrust: fill isolate space

thread work if previously thread finished work (cuda) in same block

Expected number of bank conflicts in shared memory at random access

Binary divisibility by 10

Categories

Resources