Can an exponent be stored as a floating point number? - binary

I would like to know if a binary exponent can be stored in floating point form. Here is a example of what I mean:
In a system floating point numbers use a 10-bit two's complement mantissa and a 6-bit floating point exponent
Convert 0101001000 000100 into denary:
Well if I assume that the exponent is in normal binary, the exponent equals 4
So the decimal point in the mantissa goes here initially:
0.101001000
Then we move the decimal point 4 places to the right, yielding
01010.01
Which equals 10.25 in denary.
This answer will be wildly different if the exponent can be stored as with a decimal. I am asking if the exponent can be stored in this way.

if a binary exponent can be stored in floating point form
Yes.
To form the denary from a string, use strtol().
To covert the denary into a floating-point, extract the bits into its "mantissa" and exponent. Form the FP value with ldexp().
double ldexp(double x, int exp);
The ldexp functions multiply a floating-point number by an integral power of 2.
c11dr §7.12.6.7 2
#include <math.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#define denary_MANIISSA_EXPO 9
#define denary_MANIISSA_MASK 0xFFC0u
#define denary_EXPO_SCALE 64
double denary_to_double(denary d) {
int expo = d & (denary_EXPO_SCALE - 1);
int mantissa = (d - expo) / denary_EXPO_SCALE;
return ldexp(mantissa, expo - denary_MANIISSA_EXPO);
}
void denary_test(const char *s) {
denary d = (denary) strtol(s, NULL, 2);
printf("0x%04X -->", d & 0xFFFF);
printf(" %+.9f\n", denary_to_double(d));
}
int main(void) {
denary_test("0101001000" "000100");
denary_test("0000000000" "000000"); // zero
denary_test("0000000001" "000000"); // denary_POS_MIN
denary_test("1111111111" "000000"); // denary_NEG_MIN
denary_test("0111111111" "111111"); // denary_POS_MAX
denary_test("1000000000" "111111"); // denary_NEG_MAX
}
Output
0x5204 --> +10.250000000
0x0000 --> +0.000000000
0x0040 --> +0.001953125
0xFFC0 --> -0.001953125
0x7FFF --> +9205357638345293824.000000000
0x803F --> -9223372036854775808.000000000

Related

Semantics of __ddiv_ru

From the documentation of __ddiv_ru I expect that the following code result is ceil(8/32) = 1.0, instead I obtain 0.25.
#include <iostream>
using namespace std;
__managed__ double x;
__managed__ double y;
__managed__ double r;
__global__ void ceilDiv()
{
r = __ddiv_ru(x,y);
}
int main()
{
x = 8;
y = 32;
r = -1;
ceilDiv<<<1,1>>>();
cudaDeviceSynchronize();
cout << "The ceil of " << x << "/" << y << " is " << r << endl;
return 1;
}
What am I missing?
The result you are obtaining is correct.
The intrinsic you are using implements double precision division with a specific IEEE 754-2008 rounding mode for the unit in the last place (ULP) of the significand. This controls what happens when a result cannot be exactly represented in the selected format. In this case you have selected round up, which means the last digit of the significand produced in the division result is rounded up (toward +∞). In your case all rounding modes should produce the same result because the result can be exactly represented in IEEE 754 binary64 format (it is a round power of 2).
Please read everything here before writing any more floating point code.

First year CS student trying to understand functions?

I'm a first year CS student trying to understand functions, but I'm stuck on this problem where I have to use a function within another function. I have to create a program that checks all numbers from 0 to 100, and finds all the numbers that are evenly divisible by the divisor. I'm only allowed to have three functions, which are named, getDivisor, findNumbers and calcSquare. The output is supposed to be each number that is found (from 0 to 100) and the square of that number. I wrote a program (as seen below) that runs and answers the first question as to what is the divisor, but it stays open for only a few seconds and then closes when trying to compute which numbers are divisible by the divisor. I'm not sure exactly what I did wrong, but I would like to know so I can learn from my mistake! Please disregard the style, it's very sloppy, I usually go back and clean it up after I finish the program.
#include <iostream>
#include <string>
#include <cmath>
#include <iomanip>
using namespace std;
int getDivisor();
void findNumbers(int divisor, int lower, int upper, double &lowerSquared);
double calcSquare(int lower);
int main()
{
int divisor;
int lower = 0;
int upper = 100;
double lowerSquared;
divisor = getDivisor();
cout << "Here are the numbers, from 0 to 100, that are evenly divisble by "
<< divisor << ", and their squares:\n";
findNumbers(divisor, lower, upper, lowerSquared);
system("pause");
return 0;
}
int getDivisor()
{
int divisor;
cout << "Enter a divisor: ";
cin >> divisor;
return divisor;
}
void findNumbers(int divisor, int lower, int upper, double &lowerSquared)
{
while (lower < upper)
{
if (((lower / divisor) % 2) == 0)
{
lowerSquared = calcSquare(lower);
cout << setprecision(0) << fixed << setw(4) << lower << setw(8)<< lowerSquared << endl;
lower++;
}
else
{
lower++;
}
}
}
double calcSquare(int lower)
{
double lowerSquared;
lowerSquared = pow(lower, 2);
return lowerSquared;
}
The output should be (If the user enters 15). The output should be in a list format with the number on the left and the number squared to the right of it, but I don't know how to format properly on here... sorry:
Enter a divisor: 15
Here are the numbers, from 0 to 100, that are evenly divisble by 9, and their squares:
0 0
15 115
30 900
45 2025
60 3600
75 5625
90 8100
I appreciate any assistance!
Are you getting any error? because when running your code I get and exception.
Floating point exception(core dumped)
This exception happens because you are trying to do some illegal operation with float like divide by 0 in your if statement
to fix that simply assign lower number to 1 so the count starts from 1 not 0.
int lower = 1;
Also you might want to check the logic in the if statement because as it stands it wont give result you want.
/*Description:
This program is homework assignment to practice what I
learned from lecture #7a. It illustrates how to use
functions properly, specifically how to use functions
within other functions. The user is prompted to input
a divisor that once entered goes thru a function to
see if it is evenly divisble by every number from 0-100.*/
#include <iostream>
#include <string>
#include <cmath>
#include <iomanip>
using namespace std;
int getDivisor();
void findNumbers(int divisor, int lower, int upper, double &lowerSquared);
double calcSquare(int lower);
//====================== main ===========================
//
//=======================================================
int main()
{
int divisor;
int lower = 0;
int upper = 100;
double lowerSquared;
//Gets the divisor and assigns it to this variable.
divisor = getDivisor();
cout << "Here are the numbers, from 0 to 100, that are evenly divisble by "
<< divisor << ", and their squares:\n";
//Finds the numbers that are divisible by divisor,
//displays and shows their squares.
findNumbers(divisor, lower, upper, lowerSquared);
system("pause");
return 0;
}
/*===================== getDivisor ==========================
This function gets the divisor from the user so it can
assign it to the divisor variable to use in a later
function to check and see if it is divisible from 0-100.
Input:
Divisor
Output:
Divisor being assigned to divisor variable.*/
int getDivisor()
{
int divisor;
cout << "Enter a divisor: ";
cin >> divisor;
return divisor;
}
/*===================== findNumbers ==========================
This function runs a loop from 0 to 100 to check and see
if the divisor the user inputted is evenly divisble by
every number from 0 to 100. It also displays the numbers
that are evenly divisble and their squares with the help
of the calcSquare function.
Input:
There is no user input, other than the divisor from
the getDivisor function.
Output:
Numbers between 0 and 100 that are divisible by the
divisor and their squares.*/
void findNumbers(int divisor, int lower, int upper, double &lowerSquared)
{
while (lower <= upper)
{
if (lower % divisor == 0)
{
lowerSquared = calcSquare(lower);
cout << setprecision(0) << fixed << setw(4) << lower << setw(8) <<
lowerSquared << endl;
lower++;
}
else
{
lower++;
}
}
}
/*===================== calcSquare ==========================
This function squares the number from 0 to 100 (whatever
number that might be in the loop) that is divisible by the
user entered divisor, so that it may assign it to the
lowersquared variable in the findNumbers function to be
used in the output.
Input:
Number from 0 to 100 that is divisible by user entered
divisor
Output:
Number from 0 to 100 squared.*/
double calcSquare(int lower)
{
double lowerSquared;
lowerSquared = pow(lower, 2);
return lowerSquared;
}
//==========================================================
/*OUTPUT:
Enter a divisor: 15
Here are the numbers, from 0 to 100, that are evenly divisble by 15, and their
squares:
0 0
15 225
30 900
45 2025
60 3600
75 5625
90 8100
Press any key to continue . . .
*/
//==========================================================

Finding the sum of a given number

I need help, for example the given number is 55 and I want to find out on how many 10's there are, (therefore in 55 there are five 10's) and in the program it has to show the remainder (and in 55 since there are five 10's and the remainder is 5) How do i do this??
(it has to show something like this)
Enter amount:55
Number of 10's:5
Remainder:5
use the modulus operator % like
select floor(55/10) as divisor, 55%10 as reminder
from dual
To solve this you need division and modulo operator, and just a little bit a help of build in floor method :)
What you need to do here is next:
select floor(55/10) as [Numer of tens], 55 % 10 as remainder
Look at the arithmetic operators, specifically divisor and modulo: https://dev.mysql.com/doc/refman/5.7/en/arithmetic-functions.html
#include <stdio.h>
int main(){
int dividend, divisor, quotient, remainder;
printf("Enter dividend: ");
scanf("%d", &dividend);
printf("Enter divisor: ");
scanf("%d", &divisor);
// Computes quotient
quotient = dividend / divisor;
// Computes remainder
remainder = dividend % divisor;
printf("Quotient = %d\n", quotient);
printf("Remainder = %d", remainder);
return 0;
}

Efficient bitstream convolution

I have two floating point time series A, B of length N each. I have to calculate the circular convolution and find maximum value. The classic and fastest way of doing this is
C = iFFT(FFT(A) * FFT(B))
Now, let's suppose that both A and B is a series which contains only 1s and 0s, so in principle we can represent them as bitstreams.
Question: Is there any faster way of doing the convolution (and find its maximum value) if I am somehow able to make use of the fact above ?
(I was already thinking a lot on Walsh - Hadamard transforms and SSE instructions, popcounts, but found no faster way for M > 2 **20 which is my case.)
Thanks,
gd
The 1D convolution c of two arrays a and b of size n is an array such that :
This formula can be rewritten in an iterative way :
The non-null terms of the sum are limited to the number of changes nb of b : if b is a simple pattern, this sum can be limited to a few terms. An algorithm may now be designed to compute c :
1 : compute c[0] (about n operations)
2 : for 0<i<n compute c[i] using the formula (about nb*n operations)
If nb is small, this method may be faster than fft. Note that it will provide exact results for bitstream signals, while the fft needs oversampling and floating point precision to deliver accurate results.
Here is a piece of code implementing this trick with input type unsigned char.
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
#include <fftw3.h>
typedef struct{
unsigned int nbchange;
unsigned int index[1000];
int change[1000];
}pattern;
void topattern(unsigned int n, unsigned char* b,pattern* bp){
//initialisation
bp->nbchange=0;
unsigned int i;
unsigned char former=b[n-1];
for(i=0;i<n;i++){
if(b[i]!=former){
bp->index[bp->nbchange]=i;
bp->change[bp->nbchange]=((int)b[i])-former;
bp->nbchange++;
}
former=b[i];
}
}
void printpattern(pattern* bp){
int i;
printf("pattern :\n");
for(i=0;i<bp->nbchange;i++){
printf("index %d change %d\n",bp->index[i],bp->change[i]);
}
}
//https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
unsigned int NumberOfSetBits(unsigned int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
//https://stackoverflow.com/questions/2525310/how-to-define-and-work-with-an-array-of-bits-in-c
unsigned int convol_longint(unsigned int a, unsigned int b){
return NumberOfSetBits(a&b);
}
int main(int argc, char* argv[]) {
unsigned int n=10000000;
//the array a
unsigned char* a=malloc(n*sizeof(unsigned char));
if(a==NULL){printf("malloc failed\n");exit(1);}
unsigned int i,j;
for(i=0;i<n;i++){
a[i]=rand();
}
memset(&a[2],5,2);
memset(&a[10002],255,20);
for(i=0;i<n;i++){
//printf("a %d %d \n",i,a[i]);
}
//pattern b
unsigned char* b=malloc(n*sizeof(unsigned char));
if(b==NULL){printf("malloc failed\n");exit(1);}
memset(b,0,n*sizeof(unsigned char));
memset(&b[2],1,20);
//memset(&b[120],1,10);
//memset(&b[200],1,10);
int* c=malloc(n*sizeof(int)); //nb bit in the array
memset(c,0,n*sizeof(int));
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
//computing c[0]
for(i=0;i<n;i++){
//c[0]+= convol_longint(a[i],b[i]);
c[0]+= ((int)a[i])*((int)b[i]);
//printf("c[0] %d %d\n",c[0],i);
}
printf("c[0] %d\n",c[0]);
//need to store b as a pattern.
pattern bpat;
topattern( n,b,&bpat);
printpattern(&bpat);
//computing c[i] according to formula
for(i=1;i<n;i++){
c[i]=c[i-1];
for(j=0;j<bpat.nbchange;j++){
c[i]+=bpat.change[j]*((int)a[(bpat.index[j]-i+n)%n]);
}
}
//finding max
int currmax=c[0];
unsigned int currindex=0;
for(i=1;i<n;i++){
if(c[i]>currmax){
currmax=c[i];
currindex=i;
}
//printf("c[i] %d %d\n",i,c[i]);
}
printf("c[max] is %d at index %d\n",currmax,currindex);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("computation took %lf seconds\n",time_spent);
double* dp = malloc(sizeof (double) * n);
fftw_complex * cp = fftw_malloc(sizeof (fftw_complex) * (n/2+1));
begin = clock();
fftw_plan plan = fftw_plan_dft_r2c_1d(n, dp, cp, FFTW_ESTIMATE);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
fftw_execute ( plan );
printf("fftw took %lf seconds\n",time_spent);
free(dp);
free(cp);
free(a);
free(b);
free(c);
return 0;
}
To compile : gcc main.c -o main -lfftw3 -lm
For n=10 000 000 and nb=2 (b is just a "rectangular 1D window") this algorithm run in 0.65 seconds on my computer. A double-precision fft using fftw took approximately the same time. This comparison, like most of comparisons, may be unfair since :
nb=2 is the best case for the algorithm presented in this answer.
The fft-based algorithm would have needed oversampling.
double precison may not be required for the fft-based algorithm
The implementation exposed here is not optimized. It is just basic code.
This implementation can handle n=100 000 000. At this point, using long int for c could be advised to avoid any risk of overflow.
If signals are bitstreams, this program may be optimzed in various ways. For bitwise operations, look this question and this one

Differentiate between negative and positive numbers?

In binary we can have a signed and unsigned numbers, so let's say we are given a value of 0101 how could we tell whether it is equal to 5 or to -1 as you may notice the second bit from the left is on
There is no difference in binary. The difference is in how a given language / compiler / environment / processor treats a given sequence of binary digits. For example, in the Intel x86/x64 world you have the MUL and IMUL instructions for multiplication. The IMUL instruction performs signed multiplication (i.e. treats the operand bits as a signed value). There are also other instructions that distinguish between signed/unsigned operands (e.g. DIV/IDIV, MOVSX, etc.).
Here's a quick example:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int main(void)
{
int16_t c16;
uint16_t u16;
__asm {
mov al, 0x01
mov bl, 0x8F
mul bl // ax = 0x01 * 0x8F
mov u16, ax
mov al, 0x01
mov bl, 0x8F
imul bl // ax = 0x01 * 0x8F
mov c16, ax
};
char uBits[65];
char cBits[65];
printf("%u:\t%s\n", u16, _itoa(u16, uBits, 2));
printf("%d:\t%s\n", c16, _itoa(c16, cBits, 2));
return 0;
}
Output is:
143: 10001111
-113: 11111111111111111111111110001111
On edit:
Just to expand on the example - in C/C++ (as with other languages that distinguish between signed and unsigned quantities), the compiler knows whether it is operating on signed or unsigned values and generates the appropriate instructions. In the above example, the compiler also knows it must correctly sign-extend the variable c16 when calling _itoa() because it promotes it to an int (in C/C++, int is signed by default - it is equivalent to saying signed int). The variable u16 is promoted to an unsigned int in the call to _itoa(), so no sign-extension occurs (because there is obviously no such thing as a sign bit in an unsigned value).
On actual hardware the implementation of negative numbers is dependent on what the designers chose. Usually signed numbers are represented in Two's Complement
But there are Many More