Expected number of bank conflicts in shared memory at random access - cuda

Let A be a properly aligned array of 32-bit integers in shared memory.
If a single warp tries to fetch elements of A at random, what is the expected number of bank conflicts?
In other words:
__shared__ int A[N]; //N is some big constant integer
...
int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?
Please assume Tesla or Fermi architecture. I don't want to dwell into 32-bit vs 64-bit bank configurations of Kepler. Also, for simplicity, let us assume that all the random numbers are different (thus no broadcast mechanism).
My gut feeling suggests a number somewhere between 4 and 6, but I would like to find some mathematical evaluation of it.
I believe the problem can be abstracted out from CUDA and presented as a math problem. I searched it as an extension to Birthday Paradox, but I found really scary formulas there and didn't find a final formula. I hope there is a simpler way...

In math, this is thought of as a "balls in bins" problem - 32 balls are randomly dropped into 32 bins. You can enumerate the possible patterns and calculate their probabilities to determine the distribution. A naive approach will not work though as the number of patterns is huge: (63!)/(32!)(31!) is "almost" a quintillion.
It is possible to tackle though if you build up the solution recursively and use conditional probabilities.
Look for a paper called "The exact distribution of the maximum, minimum and the range of Multinomial/Dirichlet and Multivariate Hypergeometric frequencies" by Charles J. Corrado.
In the following, we start at leftmost bucket and calculate the probabilities for each number of balls that could have fallen into it. Then we move one to the right and determine the conditional probabilities of each number of balls that could be in that bucket given the number of balls and buckets already used.
Apologies for the VBA code, but VBA was all I had available when motivated to answer :).
Function nCr#(ByVal n#, ByVal r#)
Static combin#()
Static size#
Dim i#, j#
If n = r Then
nCr = 1
Exit Function
End If
If n > size Then
ReDim combin(0 To n, 0 To n)
combin(0, 0) = 1
For i = 1 To n
combin(i, 0) = 1
For j = 1 To i
combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
Next
Next
size = n
End If
nCr = combin(n, r)
End Function
Function p_binom#(n#, r#, p#)
p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function
Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
bucket#, total_buckets#, bucket_capacity#)
If balls > bucket_capacity Then
p_next_bucket_balls = 0
Else
p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
End If
End Function
Function p_capped_buckets#(n#, cap#)
Dim p_prior, p_update
Dim bucket#, balls#, prior_balls#
ReDim p_prior(0 To n)
ReDim p_update(0 To n)
p_prior(0) = 1
For bucket = 1 To n
For balls = 0 To n
p_update(balls) = 0
For prior_balls = 0 To balls
p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
Next
Next
p_prior = p_update
Next
p_capped_buckets = p_update(n)
End Function
Function expected_max_buckets#(n#)
Dim cap#
For cap = 0 To n
expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
Next
End Function
Sub test32()
Dim p_cumm#(0 To 32)
Dim cap#
For cap# = 0 To 32
p_cumm(cap) = p_capped_buckets(32, cap)
Next
For cap = 1 To 32
Debug.Print " ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
Next
End Sub
For 32 balls and buckets, I get an expected maximum number of balls in the buckets of about 3.532941.
Output to compare to ahmad's:
1 0.000000
2 0.029273
3 0.516311
4 0.361736
5 0.079307
6 0.011800
7 0.001417
8 0.000143
9 0.000012
10 0.000001
11 0.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000
16 0.000000
17 0.000000
18 0.000000
19 0.000000
20 0.000000
21 0.000000
22 0.000000
23 0.000000
24 0.000000
25 0.000000
26 0.000000
27 0.000000
28 0.000000
29 0.000000
30 0.000000
31 0.000000
32 0.000000

I'll try a math answer, although I don't have it quite right yet.
You basically want to know, given random 32-bit word indexing within a warp into an aligned __shared__ array, "what is the expected value of the maximum number of addresses within a warp that map to a single bank?"
If I consider the problem similar to hashing, then it relates to the expected maximum number of items that will hash to a single location, and this document shows an upper bound on that number of O(log n / log log n) for hashing n items into n buckets. (The math is pretty hairy!).
For n = 32, that works out to about 2.788 (using natural log). That’s fine, but here I modified ahmad's program a bit to empirically calculate the expected maximum (also simplified the code and modified names and such for clarity and fixed some bugs).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000
int main(){
int i=0,j=0;
int *bank=(int*)malloc(sizeof(int)*NBANK);
int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
memset(maxCount, 0, sizeof(int)*(NBANK+1));
for (int i=0; i<NSAMPLE; ++i) {
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(bank, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
bank[randomNumber[j]]++;
}
for(j=0; j<WARPSIZE; j++)
max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);
// store statistic
maxCount[max_bank_conflict]++;
}
// report statistic
printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
float expected = 0;
for(i=1; i<NBANK+1; i++) {
float prob = maxCount[i]/(float)NSAMPLE;
printf("%02d -> %6.4f\n", i, prob);
expected += prob * i;
}
printf("Expected maximum bank conflict degree = %6.4f\n", expected);
return 0;
}
Using the percentages found in the program as probabilities, the expected maximum value is the sum of products sum(i * probability(i)), for i from 1 to 32. I compute the expected value to be 3.529 (matches ahmad's data). It’s not super far off, but the 2.788 is supposed to be an upper bound. Since the upper bound is given in big-O notation, I guess there’s a constant factor left out. But that's currently as far as I've gotten.
Open questions: Is that constant factor enough to explain it? Is it possible to compute the constant factor for n = 32? It would be interesting to reconcile these, and/or to find a closed form solution for the expected maximum bank conflict degree with 32 banks and 32 parallel threads.
This is a very useful topic, since it can help in modeling and predicting performance when shared memory addressing is effectively random.

I assume fermi 32-bank shared memory where each 4 consequent bytes are stored in consequent banks. Using following code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32
#define NSAMPLE 10000
int main(){
srand ( time(NULL) );
int i=0,j=0;
int *conflictCheck=NULL;
int *randomNumber=NULL;
int *statisticCheck=NULL;
conflictCheck=(int*)malloc(sizeof(int)*NBANK);
randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
while(i<NSAMPLE){
// generate a sample warp shared memory access
for(j=0; j<WARPSIZE; j++){
randomNumber[j]=rand()%NBANK;
}
// check the bank conflict
memset(conflictCheck, 0, sizeof(int)*NBANK);
int max_bank_conflict=0;
for(j=0; j<WARPSIZE; j++){
conflictCheck[randomNumber[j]]++;
max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
}
// store statistic
statisticCheck[max_bank_conflict]++;
// next iter
i++;
}
// report statistic
printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
for(i=0; i<NBANK+1; i++){
//
printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
}
return 0;
}
I got following output:
Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000
We can come to conclude that 3 to 4 way conflict is the most likely with random access. You can tune the run with different N (number of elements in array), NBANK (number of banks in shared memory), WARPSIZE (warp size of machine), and NSAMPLE (number of random shared memory accesses generated to evaluate the model).

Related

Finding the location of ones in a bit mask - Julia

I have a series of values that are each being stored as UInt16. Each of these numbers represents a bitmask - these numbers are commands that have been sent to a microprocessor telling it which pins to set high or low. I would like to parse this arrow of commands to find out which pins were being set high each time in such a way that is easier to analyse later.
Consider the example value 0x3c00, which in decimal is 15360 and in binary is 0011110000000000. Currently I have the following function
function read_message(hex_rep)
return findall.(x -> x .== '1',bitstring(hex_rep))
end
Which gets called on every element of the array of UInt16. Is there a better/more efficient way of doing this?
The best approach probably depends on how you want to handle vectors of hex-values. But here's an approach for processing a single hex which is much faster than the one in the OP:
function readmsg(x::UInt16)
N = count_ones(x)
inds = Vector{Int}(undef, N)
if N == 0
return inds
end
k = trailing_zeros(x)
x >>= k + 1
i = N - 1
inds[N] = n = 16 - k
while i >= 1
(x, r) = divrem(x, 0x2)
n -= 1
if r == 1
inds[i] = n
i -= 1
end
end
return inds
end
I can suggest padding your vector into a Vector{UInt64} and use that to manually construct a BitVector. The following should mostly work (even for input element types other than UInt16), but I haven't taken into account specific endianness you might want to respect:
julia> function read_messages(msgs)
bytes = reinterpret(UInt8, msgs)
N = length(bytes)
nchunks, remaining = divrem(N, sizeof(UInt64))
padded_bytes = zeros(UInt8, sizeof(UInt64) * cld(N, sizeof(UInt64)))
copyto!(padded_bytes, bytes)
b = BitVector(undef, N * 8)
b.chunks = reinterpret(UInt64, padded_bytes)
return b
end
read_messages (generic function with 1 method)
julia> msgs
2-element Vector{UInt16}:
0x3c00
0x8000
julia> read_messages(msgs)
32-element BitVector:
0
0
0
0
0
0
0
0
0
⋮
0
0
0
0
0
0
0
1
julia> read_messages(msgs) |> findall
5-element Vector{Int64}:
11
12
13
14
32
julia> bitstring.(msgs)
2-element Vector{String}:
"0011110000000000"
"1000000000000000"
(Getting rid of the unnecessary allocation of the undef bit vector would require some black magic, I belive.)

Is these way coalesced access?

note "When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions".
but I have some questions.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
b[i] = a[i] + a[i + 1] + a[i + 2];
}
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
or does the coalesced only exist in the different thread(transverse) of a warp?(no exist in a thread?)
I have read the similar questionss:
From non coalesced access to coalesced memory access CUDA
But I still don't understand,so is it non-coalesced memory access?
2.
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
b[i] = a[i] + a[i + 10] + a[i + 12];//assuming no out of indeax
}
It may can be the non-coalesced access.
so I change the code to:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ double shareM[3*BLOCK_SIZE];
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
b[i] = shareM[threadIdx.x] + shareM[threadIdx.x + 1] + shareM[threadIdx.x + 2];
}
I see that coalescent access do not matter with shared memory.
but it mean that is the way below coalesced access under one thread?
shareM[threadIdx.x] = a[i];
shareM[threadIdx.x + 1] = a[i + 10];
shareM[threadIdx.x + 2] = a[i + 12];
or does the shared memory coalesced access only exist in diferent thread like the fllowing example?:
thread0:
shareM[0] = a[3]
thread1:
shareM[4] = a[23]
thread2:
shareM[7] = a[56]
3.I that don't understand "coalescent access do not matter with shared memory".
is it mean that load the data to local(or register) memory from global memory slower than load the data to shared memory from global memory ?
if it is, why we don't use the shared memory as transfer station(just only one 8bytes shared memory for one thread is enough)?
thank you.
can the three accesses(a[i] , a[i + 1] , a[i + 2]) executed with only an instruction? (I mean that is it coalesced access?)
When working with GPU kernels, I guess it's better to think everything in a parallel way. Every instruction is executed in a group of 32 threads, a.k.a a warp, so they are actually not just three accesses(here the word "access" is also vague, I assume you mean array accessing), they are 32 x 3 = 96 accesses in total. A more correct way to say this is that they are three array accesses per thread.
According to [1-3], the coalesced accessing pattern is a behavior in terms of a warp:
When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.
So, we need to think respectively for these three array accesses. Let's rewrite the code as:
__global__ void add(double *a. double *b){
int i = blockDim.x * blockIdx.x + threadIdx.x;
i = 3 * i;
double ai = a[i]; // <1>
double ai1 = a[i + 1]; // <2>
double ai2 = a[i + 2]; // <3>
b[i] = ai + ai1 + ai2;
}
And it is succient to only consider the first warp with threadid range from 0 to 31.
<1>: Each thread in a warp allocates a double variable called ai in its register and wants to access a value from a based on the index i. Note the original i \in [0,31] and then it's multiped by 3, so the warp is accessing a[0], a[3], ... , a[93]. Since a is a double array(i.e. every entry is of size 8 byte), it needs to access 32 * 8 = 256 byte in total, that's two 128-byte segments that can be dealt with two 128-byte memory transactions. According to [4]:
If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently: Two memory requests, one for each half-warp, if the size is 8 bytes, Four memory requests, one for each quarter-warp, if the size is 16 bytes.
to load these 256-byte data from global memory to register, the minimum memory request number is 2. If a can be accessed in this way, then this accessing pattern is coalescing. But apparently the pattern used in <1> is not, it's like the graph below:
<1>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
+---+-------+----+--------+-------+--------+-----+--+-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
32 threads in the warp are accessing memory separately in six 128-byte segments. In the cached mode, it needs six 128-byte memory transactions at least. That's 768 bytes in total, but only 256 bytes are useful. The bus utilization is about 1/3.
<2>: This is very similar to <1>, with 1 offset from the start:
<2>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
++---+---+---+---+--------+-------+--------+------+-+-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
<3>: This is very similar to <1>, with 2 offset from the start:
<3>
t0 + t31
+---+---+---+-------------+----------------------+
| | | | ...... |
v v v v v
+-+---+---+---+--+--------+-------+--------+-------++-
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
I think now you already get the idea and probably think: How about loading these 768 bytes from global memory in one pass because all of them are used once, exactly. However, recall that each thread has its private registers and these registers cannot communicate with each other([5]), so this cannot be done merely with registers and that's where shared memory comes in.
(warp1) (warp2) (warp3)
+ + +
| | |
t0 | t31 | t0 | t31
+-+-+-+---+-+-+-+---------+---------+-+-+-+++-+-+-+-+
| | | | | | | | | ...... | | | | | | | | |
v v v v v v v v v v v v v v v v v v
+-+-+-+---+-+-+-++--------+-------+-+-+-+-+++-+-+-+---
|segment| | | | | |
+----------------+--------+-------+--------+--------+-
a[0] a[31] a[63] a[95]
is it mean that load the data to local(or register) memory from global memory slower than load the data to shared memory from global memory ? if it is, why we don't use the shared memory as transfer station(just only one 8bytes shared memory for one thread is enough)?
AFAICT, you cannot directly transfer data from global memory to shared memory.
References:
[1]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-memory-throughput
[2]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
[3]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0__examples-of-global-memory-accesses
[4]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#global-memory-3-0
[5]. I lied, there is a way to do this by using __shlf intrinsics.

Is this a CUDA thread synchronization issue or something else?

I am very new to parallel programming and stack overflow. I am working on a matrix multiplication implementation using CUDA. I am using column order float arrays as matrix representations.
The algorithm I developed is a bit unique and goes as follows. Given a matrix an n x m matrix A and an m x k matrix B, I launch an n x k blocks with m threads in each block. Essentially, I launch a block for every entry in the resulting matrix, with each thread computing one multiplication for that entry. For example,
1 0 0 0 1 2
0 1 0 * 3 4 5
0 0 1 6 7 8
For the first entry in the resulting matrix I would launch each thread with
thread 0 computing 1 * 3
thread 1 computing 0 * 0
thread 2 computing 0 * 1
With each thread adding to a 0-initialized matrix.
Right now, I am not getting a correct answer. I am getting this over and over again
0 0 2
0 0 5
0 0 8
My kernel function is below. Could this be a thread synchronization problem or am I screwing up array indexing or something?
/*#param d_A: Column order matrix
*#param d_B: Column order matrix
*#param d_result: 0-initialized matrix that kernels write to
*#param dim_A: dimensionality of A (number of rows)
*#param dim_B: dimensionality of B (number of rows)
*/
__global__ void dot(float *d_A, float *d_B, float *d_result, int dim_A, int dim_B) {
int n = blockIdx.x;
int k = blockIdx.y;
int m = threadIdx.x;
float a = d_A[(m * dim_A) + n];
float b = d_B[(k * dim_B) + m];
//d_result[(k * dim_A) + n] += (a * b);
syncthreads();
float temp = d_result[(k*dim_A) + n];
syncthreads();
temp = temp + (a * b);
syncthreads();
d_result[(k*dim_A) + n] = temp;
syncthreads();
}
The whole idea of using syncthreads() is wrong in this case. This API call has a block scope.
1. syncthreads();
2. float temp = d_result[(k*dim_A) + n];
3. syncthreads();
4. temp = temp + (a * b);
5. syncthreads();
6. d_result[(k*dim_A) + n] = temp;
7. syncthreads();
The local variable float temp; has thread scope and using this synchronization barrier is senseless.
The pointer d_result is global memory pointer and using this synchronization barrier is also senseless. Note that there isn't available yet (maybe there will never be available) a barrier which synchronizes threads globally.
Typically the usage of syncthreads() is required when shared memory is used for computation. In this case you may want to use shared memory. Here you could see an example of how to use shared memory and syncthreads() properly. Here you have an example of matrix multiplication with shared memory.

cudaMemset() - does it set bytes or integers?

From online documentation:
cudaError_t cudaMemset (void * devPtr, int value, size_t count )
Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
Parameters:
devPtr - Pointer to device memory
value - Value to set for each byte of specified memory
count - Size in bytes to set
This description doesn't appear to be correct as:
int *dJunk;
cudaMalloc((void**)&dJunk, 32*(sizeof(int));
cudaMemset(dJunk, 0x12, 32);
will set all 32 integers to 0x12, not 0x12121212. (Int vs Byte)
The description talks about setting bytes. Count and Value are described in terms of bytes. Notice count is of type size_t, and value is of type int. i.e. Set a byte-size to an int-value.
cudaMemset() is not mentioned in the prog guide.
I have to assume the behavior I am seeing is correct, and the documentation is bad.
Is there a better documentation source out there? (Where?)
Are other types supported? i.e. Would float *dJunk; work? Others?
The documentation is correct, and your interpretation of what cudaMemset does is wrong. The function really does set byte values. Your example sets the first 32 bytes to 0x12, not all 32 integers to 0x12, viz:
#include <cstdio>
int main(void)
{
const int n = 32;
const size_t sz = size_t(n) * sizeof(int);
int *dJunk;
cudaMalloc((void**)&dJunk, sz);
cudaMemset(dJunk, 0, sz);
cudaMemset(dJunk, 0x12, 32);
int *Junk = new int[n];
cudaMemcpy(Junk, dJunk, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
fprintf(stdout, "%d %x\n", i, Junk[i]);
}
cudaDeviceReset();
return 0;
}
produces
$ nvcc memset.cu
$ ./a.out
0 12121212
1 12121212
2 12121212
3 12121212
4 12121212
5 12121212
6 12121212
7 12121212
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
31 0
ie. all 128 bytes set to 0, then first 32 bytes set to 0x12. Exactly as described by the documentation.

Binary divisibility by 10

How to check if a binary number can be divided by 10 (decimal), without converting it to other system.
For example, we have a number:
1010 1011 0100 0001 0000 0100
How we can check that this number is divisible by 10?
First split the number into odd and even bits (I'm calling "even" the
bits corresponding to even powers of 2):
100100110010110000000101101110
0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 even 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 odd
Now in each of these, add and subtract the digits alternately, as in
the standard test for divisibility by 11 in decimal (starting with
addition at the right):
100100110010110000000101101110 +0-1+0-1+0-0+1-0+0-0+1-1+0-1+0 =
-2 +1-0+0-1+0-1+1-0+0-0+0-0+1-1+1 = 1
Now double the sum of the odd digits and add it to the sum of the even
digits:
2*1 + -2 = 0
If the result is divisible by 5, as in this case, the number itself is
divisible by 5.
Since this number is also divisible by 2 (the rightmost digit being
0), it is divisible by 10.
Link
If you are talking about computational methods, you can do a divisiblity-by-5 test and a divisibility-by-2 test.
The numbers below assume unsigned 32-bit arithmetic, but can easily be extended to larger numbers.
I'll provide some code first, followed by a more textual explanation:
unsigned int div5exact(unsigned int n)
{
// returns n/5 as long as n actually divides 5
// (because 'n * (INV5 * 5)' == 'n * 1' mod 2^32
#define INV5 0xcccccccd
return n * INV5;
}
unsigned int divides5(unsigned int n)
{
unsigned int q = div5exact(n);
if (q <= 0x33333333) /* q*5 < 2^32? */
{
/* q*5 doesn't overflow, so n == q*5 */
return 1;
}
else
{
/* q*5 overflows, so n != q*5 */
return 0;
}
}
int divides2(unsigned int n)
{
/* easy divisibility by 2 test */
return (n & 1) == 0;
}
int divides10(unsigned int n)
{
return divides2(n) && divides5(n);
}
/* fast one-liner: */
#define DIVIDES10(n) ( ((n) & 1) == 0 && ((n) * 0xcccccccd) <= 0x33333333 )
Divisibility by 2 is easy: (n&1) == 0 means that n is even.
Divisibility by 5 involves multiplying by the inverse of 5, which is 0xcccccccd (because 0xcccccccd * 5 == 0x400000001, which is just 0x1 if you truncate to 32 bits).
When you multiply n*5 by the inverse of 5, you get n * 5*(inverse of 5), which in 32-bit math simplifies to n*1 .
Now let's say n and q are 32-bit numbers, and q = n*(inverse of 5) mod 232.
Because n is no greater than 0xffffffff, we know that n/5 is no greater than (232-1)/5 (which is 0x33333333). Therefore, we know if q is less than or equal to (232-1)/5, then we know n divides exactly by 5, because q * 5 doesn't get truncated in 32 bits, and is therefore equal to n, so n divides q and 5.
If q is greater than (232-1)/5, then we know it doesn't divide 5, because there is a one-one mapping between the 32-bit numbers divisible by 5 and the numbers between 0 and (232-1)/5, and so any number out of this range doesn't map to a number that's divisible by 5.
Here is the code in python to check the divisibilty by 10 using bitwise technique
#taking input in string which is a binary number eg: 1010,1110
s = input()
#taking initial value of x as o
x = 0
for i in s:
if i == '1':
x = (x*2 + 1) % 10
else:
x = x*2 % 10
#if x is turn to be 0 then it is divisible by 10
if x:
print("Not divisible by 10")
else:
print("Divisible by 10")