In binary we can have a signed and unsigned numbers, so let's say we are given a value of 0101 how could we tell whether it is equal to 5 or to -1 as you may notice the second bit from the left is on
There is no difference in binary. The difference is in how a given language / compiler / environment / processor treats a given sequence of binary digits. For example, in the Intel x86/x64 world you have the MUL and IMUL instructions for multiplication. The IMUL instruction performs signed multiplication (i.e. treats the operand bits as a signed value). There are also other instructions that distinguish between signed/unsigned operands (e.g. DIV/IDIV, MOVSX, etc.).
Here's a quick example:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int main(void)
{
int16_t c16;
uint16_t u16;
__asm {
mov al, 0x01
mov bl, 0x8F
mul bl // ax = 0x01 * 0x8F
mov u16, ax
mov al, 0x01
mov bl, 0x8F
imul bl // ax = 0x01 * 0x8F
mov c16, ax
};
char uBits[65];
char cBits[65];
printf("%u:\t%s\n", u16, _itoa(u16, uBits, 2));
printf("%d:\t%s\n", c16, _itoa(c16, cBits, 2));
return 0;
}
Output is:
143: 10001111
-113: 11111111111111111111111110001111
On edit:
Just to expand on the example - in C/C++ (as with other languages that distinguish between signed and unsigned quantities), the compiler knows whether it is operating on signed or unsigned values and generates the appropriate instructions. In the above example, the compiler also knows it must correctly sign-extend the variable c16 when calling _itoa() because it promotes it to an int (in C/C++, int is signed by default - it is equivalent to saying signed int). The variable u16 is promoted to an unsigned int in the call to _itoa(), so no sign-extension occurs (because there is obviously no such thing as a sign bit in an unsigned value).
On actual hardware the implementation of negative numbers is dependent on what the designers chose. Usually signed numbers are represented in Two's Complement
But there are Many More
Related
I would like to know if a binary exponent can be stored in floating point form. Here is a example of what I mean:
In a system floating point numbers use a 10-bit two's complement mantissa and a 6-bit floating point exponent
Convert 0101001000 000100 into denary:
Well if I assume that the exponent is in normal binary, the exponent equals 4
So the decimal point in the mantissa goes here initially:
0.101001000
Then we move the decimal point 4 places to the right, yielding
01010.01
Which equals 10.25 in denary.
This answer will be wildly different if the exponent can be stored as with a decimal. I am asking if the exponent can be stored in this way.
if a binary exponent can be stored in floating point form
Yes.
To form the denary from a string, use strtol().
To covert the denary into a floating-point, extract the bits into its "mantissa" and exponent. Form the FP value with ldexp().
double ldexp(double x, int exp);
The ldexp functions multiply a floating-point number by an integral power of 2.
c11dr ยง7.12.6.7 2
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#define denary_MANIISSA_EXPO 9
#define denary_MANIISSA_MASK 0xFFC0u
#define denary_EXPO_SCALE 64
double denary_to_double(denary d) {
int expo = d & (denary_EXPO_SCALE - 1);
int mantissa = (d - expo) / denary_EXPO_SCALE;
return ldexp(mantissa, expo - denary_MANIISSA_EXPO);
}
void denary_test(const char *s) {
denary d = (denary) strtol(s, NULL, 2);
printf("0x%04X -->", d & 0xFFFF);
printf(" %+.9f\n", denary_to_double(d));
}
int main(void) {
denary_test("0101001000" "000100");
denary_test("0000000000" "000000"); // zero
denary_test("0000000001" "000000"); // denary_POS_MIN
denary_test("1111111111" "000000"); // denary_NEG_MIN
denary_test("0111111111" "111111"); // denary_POS_MAX
denary_test("1000000000" "111111"); // denary_NEG_MAX
}
Output
0x5204 --> +10.250000000
0x0000 --> +0.000000000
0x0040 --> +0.001953125
0xFFC0 --> -0.001953125
0x7FFF --> +9205357638345293824.000000000
0x803F --> -9223372036854775808.000000000
CUDA has popcount intrinsics for 32-bit and 64-bit types: __popc() and __popcll().
Does CUDA also have intrinsics to get the parity of 32-bit and 64-bit types?
(The parity refers to whether an integer has an even or odd amount of 1-bits.)
For example, GCC has __builtin_parityl() for 64-bit integers.
And here's a C function that does the same thing:
inline uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111lu) * 0x1111111111111111lu;
return (n >> 60) & 1;
}
I'm not aware of a parity intrinsic for CUDA.
However you should be able to create a fairly simple function to do it using either the __popc() (32-bit unsigned case) or __popcll() (64-bit unsigned case) intrinsics.
For example, the following function should indicate whether the number of 1 bits in a 64-bit unsigned quantity is odd (true) or even (false):
__device__ bool my_parity(unsigned long long d){
return (__popcll(d) & 1);}
Here i have been given an exam question that i partly solved but do not understand it completely
why it is used volatile here? and the missing expression
i have must be switches >>8.
when it comes to translation i have some difficulty.
Eight switches are memory mapped to the memory address 0xabab0020, where the
least significant bit (index 0) represents switch number 1 and the bit with index 7
represents switch number 8. A bit value 1 indicates that the switch is on and
0 means that it is off. Write down the missing C code expression, such that the
while loop exits if the toggle switch number 8 is off.
volatile int * switches = (volatile int *) 0xabab0020;
volatile int * leds = (volatile int *) 0xabab0040;
while(/* MISSING C CODE EXPRESSION */){
*leds = (*switches >> 4) & 1;
}
Translate the complete C code above into MIPS assembly code, including the missing C code expression. You are not allowed to use pseudo instructions.
without volatile your code can legally be interpreted by the compiler as:
int * switches = (volatile int *) 0xabab0020;
int * leds = (volatile int *) 0xabab0040;
*leds = (*switches >> 4) & 1;
while(/* MISSING C CODE EXPRESSION */){
}
The volatile qualifier is an indication to the C compiler that the data at addresses switches and leds can be changed by another agent in the system. Without the volatile qualifier, the compiler would be allowed to optimize references to these variables away.
The problem description says the loop should run while bit 7 of *switches is set, i.e: while (*switches & 0x80 != 0)
Translating the code is left as an exercise for the reader.
volatile int * switches = (volatile int *) 0xabab0020;
volatile int * leds = (volatile int *) 0xabab0040;
while((*switches >> 8) & 1){
*leds = (*switches >> 4) & 1;
}
To mips
lui $s0,0xabab #load the upper half
ori $s0,0x0020
lui $s1,0xabab
ori $s1,0x0040
while:
lw $t0,0x20($s0)
srl $t0,$t0,8 #only the 8th bit is important
andi $t0,$t0,1 # clear other bit keep the LSB
beq $t0,$0, done
lw $t1,0x40($s1)
srl $t1,$t1,4
andi $t1,$t1,1
sw $t1,0x40($s1)
j while
done:
sw $t0,0x20($s0)
A CUDA program should do reduction for double-precision data, I use Julien Demouth's slides named "Shuffle: Tips and Tricks"
the shuffle function is below:
/*for shuffle of double-precision point */
__device__ __inline__ double shfl(double x, int lane)
{
int warpSize = 32;
// Split the double number into 2 32b registers.
int lo, hi;
asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
// Shuffle the two 32b registers.
lo = __shfl_xor(lo,lane,warpSize);
hi = __shfl_xor(hi,lane,warpSize);
// Recreate the 64b number.
asm volatile("mov.b64 %0,{%1,%2};":"=d"(x):"r"(lo),"r"(hi));
return x;
}
At present, I got the errors below while compiling the program.
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 71; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 271; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 287; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 302; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 317; error : Arguments mismatch for instruction 'mov'
ptxas /tmp/tmpxft_00002cfb_00000000-5_csr_double.ptx, line 332; error : Arguments mismatch for instruction 'mov'
ptxas fatal : Ptx assembly aborted due to errors
make: *** [csr_double] error 255
Could someone give some advice?
There is a syntax error in the inline assembly instruction for the load of the double argument to 32 bit registers. This:
asm volatile("mov.b32 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
should be:
asm volatile("mov.b64 {%0,%1}, %2;":"=r"(lo),"=r"(hi):"d"(x));
Using a "d" (ie 64 bit floating point register) as the source in a 32 bit load is illegal (and a mov.b32 makes no sense here, the code must load 64 bits to two 32 bit registers).
As of CUDA 9.0, __shfl, __shfl_up, __shfl_down and __shfl_xor have been deprecated.
The newly introduced functions __shfl_sync, __shfl_up_sync, __shfl_down_sync and __shfl_xor_sync have the following prototypes:
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int
width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
where T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float or double.
You no longer need to create your own shuffle instructions for double-precision arithmetics.
I am working on the GPU algorithm which is supposed to do a lot of modular computations. Particularly, various operations on matrices in a finite field which in the long run
reduce to primitive operations like: (a*b - c*d) mod m or (a*b + c) mod m where a,b,c and d are residues modulo m and m is a 32-bit prime.
Through experimentation I learned that the performance of the algorithm is mostly limited by slow modular arithmetic because integer modulo (%) and division operations are not supported on the GPU in hardware.
I appreciate if somebody can give me an idea how to realize efficient modular computations with CUDA ?
To see how this is implemented on CUDA, I use the following code snippet:
__global__ void mod_kernel(unsigned *gout, const unsigned *gin) {
unsigned tid = threadIdx.x;
unsigned a = gin[tid], b = gin[tid * 2], m = gin[tid * 3];
typedef unsigned long long u64;
__syncthreads();
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
__syncthreads();
gout[tid] = r;
}
This code is not supposed to work, I just wanted to see how modular reduction is
implemented on CUDA.
When I disassemble this with cuobjdump --dump-sass (thanks njuffa for advice!), I see the following:
/*0098*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
/*00a0*/ /*0x1c315c4350000000*/ IMUL.U32.U32.HI R5, R3, R7;
/*00a8*/ /*0x1c311c0350000000*/ IMUL.U32.U32 R4, R3, R7;
/*00b0*/ /*0xfc01dde428000000*/ MOV R7, RZ;
/*00b8*/ /*0xe001000750000000*/ CAL 0xf8;
/*00c0*/ /*0x00000007d0000000*/ BPT.DRAIN 0x0;
/*00c8*/ /*0xffffdc0450ee0000*/ BAR.RED.POPC RZ, RZ;
Note that between the two calls to bar.red.popc there is a call to 0xf8 procedure which implements some sophisticated algorithm (about 50 instructions or even more). Not surpising that mod (%) operation is slow
Some time ago I experimented a lot with modular arithmetic on the GPU. On Fermi GPUs you can use double-precision arithmetic to avoid expensive div and mod operations. For example, modular multiplication can be done as follows:
// fast truncation of double-precision to integers
#define CUMP_D2I_TRUNC (double)(3ll << 51)
// computes r = a + b subop c unsigned using extended precision
#define VADDx(r, a, b, c, subop) \
asm volatile("vadd.u32.u32.u32." subop " %0, %1, %2, %3;" : \
"=r"(r) : "r"(a) , "r"(b), "r"(c));
// computes a * b mod m; invk = (double)(1<<30) / m
__device__ __forceinline__
unsigned mul_m(unsigned a, unsigned b, volatile unsigned m,
volatile double invk) {
unsigned hi = __umulhi(a*2, b*2); // 3 flops
// 2 double instructions
double rf = __uint2double_rn(hi) * invk + CUMP_D2I_TRUNC;
unsigned r = (unsigned)__double2loint(rf);
r = a * b - r * m; // 2 flops
// can also be replaced by: VADDx(r, r, m, r, "min") // == umin(r, r + m);
if((int)r < 0)
r += m;
return r;
}
However this only works for 31-bit integer modulos (if 1 bit is not critical for you)
and you also need to precompute 'invk' beforehand. This gives absolute minimum of instructions I can achieve, ie.:
SHL.W R2, R4, 0x1;
SHL.W R8, R6, 0x1;
IMUL.U32.U32 R4, R4, R6;
IMUL.U32.U32.HI R8, R2, R8;
I2F.F64.U32 R8, R8;
DFMA R2, R2, R8, R10;
IMAD.U32.U32 R4, -R12, R2, R4;
ISETP.GE.AND P0, pt, R4, RZ, pt;
#!P0 IADD R4, R12, R4;
For description of the algorithm, you can have a look at my paper:
gpu_resultants. Other operations like (xy - zw) mod m are also explained there.
Out of curiosity, I compared the performance of the resultant algorithm
using your modular multiplication:
unsigned r = (unsigned)(((u64)a * (u64)b) % m);
against the optimized version with mul_m.
Modular arithmetic with default % operation:
low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495
res time: 5755.357910 ms; mod_inv time: 0.907008 ms; interp time: 856.015015 ms; CRA time: 44.065857 ms
GPU time elapsed: 6659.405273 ms;
Modular arithmetic with mul_m:
low_deg: 11; high_deg: 2481; bits: 10227
nmods: 330; n_real_pts: 2482; npts: 2495
res time: 1100.124756 ms; mod_inv time: 0.192608 ms; interp time: 220.615143 ms; CRA time: 10.376352 ms
GPU time elapsed: 1334.742310 ms;
So on the average it is about 5x faster. Note also that, you might not see a speed-up if you just evaluate raw arithmetic performance using a kernel with a bunch of mul_mod operations (like saxpy example). But in real applications with control logic, synchronization barriers etc. the speed-up is very noticeable.
A high-end Fermi GPU (e.g. a GTX 580) will likely give you the best performance among shipping cards for this. You would want all 32-bit operands to be of type "unsigned int" for best performance, as there is some additional overhead for the handling of signed divisions and modulos.
The compiler generates very efficient code for division and modulo with fixed divisor As I recall it is usually around three to five machine instructions instructions on Fermi and Kepler. You can check the generated SASS (machine code) with cuobjdump --dump-sass. You might be able to use templated functions with constant divisors if you only use a few different divisors.
You should see on the order of sixteen inlined SASS instructions being generated for the unsigned 32-bit operations with variable divisor, across Fermi and Kepler. The code is limited by the throughput of integer multiplies and for Fermi-class GPUs is competitive with hardware solutions. Somewhat reduced performance is seen on currently shipping Kepler-class GPUs due to their reduced integer multiply throughput.
[Added later, after clarification of the question:]
Unsigned 64-bit division and modulo with variable divisor on the other hand are called subroutines of about 65 instructions on Fermi and Kepler. They look close to optimal. On Fermi, this is still reasonably competitive with hardware implementations (note that 64-bit integer divisions are not exactly super fast on CPUs that provide this as a built-in instruction). Below is some code that I posted to the NVIDIA forums some time back for the kind of task described in the clarification. It avoids the expensive division, but does assume that fairly large batches of operands are sharing the same divisior. It uses double-precision arithmetic, which is especially fast on Tesla-class GPUs (as opposed to consumer cards). I only did a cursory test of the code, you might want to test this more carefully before deploying it.
// Let b, p, and A[i] be integers < 2^51
// Let N be a integer on the order of 10000
// for i from 1 to N
// A[i] <-- A[i] * b mod p
/*---- kernel arguments ----*/
unsigned long long *A;
double b, p; /* convert from unsigned long long to double before passing to kernel */
double oop; /* pass precomputed 1.0/p to kernel */
/*---- code inside kernel -----*/
double a, q, h, l, rem;
const double int_cvt_magic = 6755399441055744.0; /* 2^52+2^51 */
a = (double)A[i];
/* approximate quotient and round it to the nearest integer */
q = __fma_rn (a * b, oop, int_cvt_magic);
q = q - int_cvt_magic;
/* back-multiply, representing p*q as a double-double h:l exactly */
h = p * q;
l = __fma_rn (p, q, -h);
/* remainder is double-width product a*b minus double-double h:l */
rem = __fma_rn (a, b, -h);
rem = rem - l;
/* remainder may be negative as quotient rounded; fix if necessary */
if (rem < 0.0) rem += p;
A[i] = (unsigned long long)rem;
There are tricks to efficiently perform mod operations but if only m is radix 2.
For instance, x mod y == x & (y-1), where y is 2^n. Performing bitwise operation is the fastest.
Otherwise, probably a look-up table?
Below is a link on discussion of efficient modulo implementation. You might need to implement it yourself to get the most out of it.
Efficient computation of mod