How to lossless convert a double to string and back in Octave - octave

When saving a double to a string there is some loss of precision. Even if you use a very large number of digits the conversion may not be reversible, i.e. if you convert a double x to a string sx and then convert back you will get a number x' which may not be bitwise equal to x. This may cause some problem for instance when checking for differences in a battery of tests. One possibility is to use binary form (for instance the native Binary form, or HDF5) but I want to store the number in a text file, so I need a conversion to a string. I have a working solution but I ask if there is some standard for this or a better solution.
In C/C++ you could cast the double to some integer type like char* and then convert each byte to an hexa of length 2 with printf("%02x",c[j]). Then for instance PI would be converted to a string of length 16: 54442d18400921fb. The problem with this is that if you read the hexa you don get any idea of which number it is. So I would be interested in some mix for instance pi -> 3.14{54442d18400921fb}. The first part is a (probably low precision) decimal representation of the number (typically I would use a "%g" output conversion) and the string in braces is the lossless hexadecimal representation.
EDIT: I pass the code as an aswer

Following the ideas already suggested in the post I wrote the
following functions, that seem to work.
function s = dbl2str(d);
z = typecast(d,"uint32");
s = sprintf("%.3g{%08x%08x}\n",d,z);
endfunction
function d = str2dbl(s);
k1 = index(s,"{");
k2 = index(s,"}");
## Check that there is a balanced {} or not at all
assert((k1==0) == (k2==0));
if k1>0; assert(k2>k1); endif
if (k1==0);
## If there is not {hexa} part convert with loss
d = str2double(s);
else
## Convert lossless
ss = substr(s,k1+1,k2-k1-1);
z = uint32(sscanf(ss,"%8x",2));
d = typecast(z,"double");
endif
endfunction
Then I have
>> spi=dbl2str(pi)
spi = 3.14{54442d18400921fb}
>> pi2 = str2dbl(spi)
pi2 = 3.1416
>> pi2-pi
ans = 0
>> snan = dbl2str(NaN)
snan = NaN{000000007ff80000}
>> nan1 = str2dbl(snan)
nan1 = NaN
A further improvement would be to use other type of enconding, for
instance Base64 (as suggested by #CrisLuengo in a comment) that would
reduce the length of the binary part from 16 to 11 bytes.

Related

How can I convert a bitstring to the binary form in Julia

I am using bitstring to perform an xor operation on the ith bit of a string:
string = bitstring(string ⊻ 1 <<i)
However the result will be a string, so I cannot continue with other i.
So I want to know how do I convert a bitstring (of the form “000000000000000000000001001”) to (0b1001)?
Thanks
You can use parse to create an integer from the string, and then use string (alt. bitstring)to go the other way. Examples:
julia> str = "000000000000000000000001001";
julia> x = parse(UInt, str; base=2) # parse as UInt from input in base 2
0x0000000000000009
julia> x == 0b1001
true
julia> string(x; base=2) # stringify in base 2
"1001"
julia> bitstring(x) # stringify as bits (64 bits since UInt64 is 64 bits)
"0000000000000000000000000000000000000000000000000000000000001001"
don't use bitstring. You can either do the math with a BitVector or just a UInt. No reason to bring a String into it.

Cythonize: check if word in list of strings is a substring of another string

I want to iterate over a list of input words list_words and check if any belongs to an input string.
I tried to cythonize the code but when I annotate it I see almost all of it in yellow, suggesting python interactions.
Not sure how I could speedup this:
cpdef cy_check_any_word_is_substring(list_words, string):
cdef unicode w
cdef unicode s_lowered = string.lower()
for w in list_words:
if w in s_lowered:
return True
return False
Example
# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
input_string = 'The animal saw the Dog and started to make noises'
# should return true
cy_check_any_word_is_substring(list_words, input_string)
Note I want to make the code work independently if characters are capitalized or not (that is why I do string.lower()), I assume the input list of words is already lowered.
Update
I wonder if a solution that uses C++ could be faster.
I don't know C++ though, I tried
from libcpp.vector cimport vector
from libcpp.string cimport string
cpdef cy_check_any_word_is_substring(vector[string] list_words,string string):
s_lowered = string.lower()
for w in list_words:
if w in s_lowered:
return True
return False
But it produces the error
Invalid types for 'in' (string, Python object)
Update 2
I tried the following to avoid the error presented in the previous section update.
from libcpp.vector cimport vector
from libcpp.string cimport string,npos
cdef bint cy_check_w_substring(string s_lowered, vector[string] list_words):
cdef string w
for w in list_words:
if s_lowered.find(w) !=npos:
return True
return False
cpdef cy3_check_any_word_is_substring(words_bytes, input_string):
cdef bint result = False
s_lowered = input_string.lower()
result = cy_check_w_substring(bytes(s_lowered, 'utf8'), words_bytes)
return result
This can be used changing the original list of words as a list of bytes.
# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
list_words_bytes = [bytes(w,'utf8') for w in list_words]
input_string = 'The animal saw the Dog and started to make noises'
# should return true
cy3_check_any_word_is_substring(list_words_bytes, input_string)
Nevertheless this is much slower
%%timeit
cy3_check_any_word_is_substring(list_words_bytes, input_string)
#1.01 µs ± 3.16 ns per loop
%%timeit
cy_check_any_word_is_substring(list_words, input_string)
#190 ns ± 0.773 ns per loop
Note cy3_check_any_word_is_substring internally casts s_lowered as bytes but this already takes 145 ns which is almost the cost of cy_check_any_word_is_substring which makes this approach clearly not viable.
%%timeit
bytes(input_string, 'utf8')
#145 ns ± 0.55 ns per loop
The basic problem with the C++ solution is that if you pass it a Python iterable there's a hidden type conversion. So it has to iterate through the entire list and then convert every string to a C++ string. For this reason I doubt it'll give you much benefit.
If you can generate the data as a C++ vector without the type conversion then it may work better. For this you should use a cdef function instead of a cpdef function (I rarely like cpdef functions because they're usually the worst of both worlds).
The specific problems you have:
The C++ string class does not have a .lower() function, so the line s_lowered = string.lower() is implicitly converting it back to a Python bytes then calling .lower() on that. You'll have to implement .lower yourself (or convert to the C++ string after calling .lower on the Python object).
w in s_lowered isn't implemented for C++ strings. You want s_lowered.find(w) != npos (where npos is cimported from libcpp.string).

Shouldn't TclInvalidateStringRep() reset length?

I have a doubt on the following code in TCL 8.6.8 source tclInt.h:
4277 #define TclInvalidateStringRep(objPtr) \
4278 if (objPtr->bytes != NULL) { \
4279 if (objPtr->bytes != tclEmptyStringRep) { \
4280 ckfree((char *) objPtr->bytes); \
4281 } \
4282 objPtr->bytes = NULL; \
4283 }
This macro is called by Tcl_InvalidateStringRep() in tclObj.c.
My doubt is, why doesn't tclObj's length get reset to zero?
Here is from definition of Tcl_Obj:
808 typedef struct Tcl_Obj {
809 int refCount; /* When 0 the object will be freed. */
810 char *bytes; /* This points to the first byte of the
811 * object's string representation. The array
812 * must be followed by a null byte (i.e., at
813 * offset length) but may also contain
814 * embedded null characters. The array's
815 * storage is allocated by ckalloc. NULL means
816 * the string rep is invalid and must be
817 * regenerated from the internal rep. Clients
818 * should use Tcl_GetStringFromObj or
819 * Tcl_GetString to get a pointer to the byte
820 * array as a readonly value. */
821 int length; /* The number of bytes at *bytes, not
822 * including the terminating null. */
So you can see length is tightly coupled with bytes, when bytes is cleared, shouldn't we reset length?
My doubt comes from the following code, TclCreateLiteral() in tclLiteral.c:
200 for (globalPtr=globalTablePtr->buckets[globalHash] ; globalPtr!=NULL;
201 globalPtr = globalPtr->nextPtr) {
202 objPtr = globalPtr->objPtr;
203 if ((globalPtr->nsPtr == nsPtr)
204 && (objPtr->length == length) && ((length == 0)
205 || ((objPtr->bytes[0] == bytes[0])
206 && (memcmp(objPtr->bytes, bytes, (unsigned) length) == 0)))) {
So at line 204, when length is not zero while bytes is NULL, the program crashes.
My product includes TCL source and I find the above problem when I trace a program crash. I put the workaround in our code, but like to confirm with the community if it indeed is a vulnerability.
Your aproach seems to be wrong somewhere.
The call of TclInvalidateStringRep is basically allowed for objects with no references (refCount == 0) or with exactly one reference (so refCount <= 1) and then only if you are sure, that this 1 reference is your own reference only.
Tcl's shared objects could switch its internal representation, but the string representation remains immutable. Otherwise you will break the basic principles of Tcl (like EIAS, etc).
Simplest example that can explain this:
set k 0x7f
dict set d $k test
expr {$k}; # ==> 127 (obj is integer now, but...)
puts $k; # ==> 0x7f (... still remains the string-representation)
puts [dict get $d $k]; # ==> test
# some code that fouls it up (despite of two references var `k` and key in dict `d`):
magic_happens_here $k; # string representation gets lost.
# and hereafter:
puts $k; # ==> 127 (representation is now 127, so...)
puts [dict get $d $k]; # ==> ERROR: key "127" not known in dictionary
As you can see, reset resp. altering of the string representation of shared object is wrong by design.
Please avoid this in Tcl.
I've had a think about this, and while I believe that the code that is purging the representation is wrong to do so (since the object should in principle be shared and so shouldn't be observed to change) I certainly think that it is extremely difficult to actually prove that that can't happen. For sure, TclCreateLiteral in tclLiteral.c shouldn't blow up if it happens!
The fix I'm using is to make TclCreateLiteral use TclGetStringFromObj (the Tcl-internal macro-ized version of Tcl_GetStringFromObj) to get the bytes and length fields instead of using them directly, so that the correct constraints are preserved. This should make the string representation exist once more if it is removed. If the code continues to crash, the problem is your code that is calling TclInvalidateStringRep on a literal (and setting a type that can't have a string generated for it; Tcl has some of those, but that's because it never purges the original string from them).
Remember, a Tcl_Obj should only have its string rep purged when it becomes wrong, not just when it gains a non-string representation. The fact a value has been interpreted as an integer doesn't mean that it shouldn't be interpretable as a list (quite the reverse!) and if the internal representation is never updated to a different value (in-place modifications should only ever happen to unshared objects) it should never need to lose that string representation at all.

Double-precision error using Dislin

I get the following error when trying to compile:
call qplot (Z, B, m + 1)
1
Error: Type mismatch in argument 'x' at (1); passed REAL(8) to REAL(4)
Everything seems to be in double precision so I can't help but think it is a Dislin error, especially considering that it appears with reference to a Dislin statement. What am I doing wrong? My code is the following:
program test
use dislin
integer :: i
integer, parameter :: n = 2
integer, parameter :: m = 5000
real (kind = 8) :: X(n + 1), Z(0:m), B(0:m)
X(1) = 1.D0
X(2) = 0.D0
X(3) = 2.D0
do i = 0, m
Z(i) = -1.D0 + (2.D0*i) / m
B(i) = f(Z(i))
end do
call qplot (Z, B, m + 1)
read(*,*)
contains
real (kind = 8) function f(t)
implicit none
real (kind = 8), intent(in) :: t
real (kind = 8), parameter :: pi = Atan(1.D0)*4.D0
f = cos(pi*t)
end function f
end program
From the DISLIN manual I read that qplot requires (single precision) floats:
QPLOT connects data points with lines.
The call is: CALL QPLOT (XRAY, YRAY, N) level 0, 1
or: void qplot (const float *xray, const float *yray, int n);
XRAY, YRAY are arrays that contain X- and Y-coordinates.
N is the number of data points.
So you need to convert Z and B to real:
call qplot (real(Z), real(B), m + 1)
Instead of using fixed numbers for the kind of numbers (which vary between compilers), please consider using the ISO_Fortran_env module and the pre-defined constants REAL32 and REAL64.
The qplot routine requires a default real. You can convert your data
call qplot(real(Z), real(B), m + 1)
I second the remark with kind = 8, it is very ugly, if you insist on 8 at least declare a constant
integer, parameter :: rp = 8
and use
real(rp) ::
As the first two answers explain, the standard versions of the dislin routines require single precision arguments. I find it most convenient to use these since I may have single or double arguments, using the real technique to convert the type of double variables. It seems unlikely that the lost precision will be perceptible on a graph. However, if you wish to work exclusively in double precision, there is an alternative set of routines. They have the same names, but take double precision arguments. To obtain them, link in the library "dislin_d".

Converting to Base 10

Question
Let's say I have a string or array which represents a number in base N, N>1, where N is a power of 2. Assume the number being represented is larger than the system can handle as an actual number (an int or a double etc).
How can I convert that to a decimal string?
I'm open to a solution for any base N which satisfies the above criteria (binary, hex, ...). That is if you have a solution which works for at least one base N, I'm interested :)
Example:
Input: "10101010110101"
-
Output: "10933"
It depends on the particular language. Some have native support for arbitrary-length integers, and others can use libraries such as GMP. After that it's just a matter of doing the lookup in a table for the digit value, then multiplying as appropriate.
This is from a Python-based computer science course I took last semester that's designed to handle up to base-16.
import string
def baseNTodecimal():
# get the number as a string
number = raw_input("Please type a number: ")
# convert it to all uppercase to match hexDigits (below)
number = string.upper(number)
# get the base as an integer
base = input("Please give me the base: ")
# the number of values that we have to change to base10
digits = len(number)
base10 = 0
# first position of any baseN number is 1's
position = 1
# set up a string so that the position of
# each character matches the decimal
# value of that character
hexDigits = "0123456789ABCDEF"
# for each 'digit' in the string
for i in range(1, digits+1):
# find where it occurs in the string hexDigits
digit = string.find(hexDigits, number[-i])
# multiply the value by the base position
# and add it to the base10 total
base10 = base10 + (position * digit)
print number[-i], "is in the " + str(position) + "'s position"
# increase the position by the base (e.g., 8's position * 2 = 16's position)
position = position * base
print "And in base10 it is", base10
Basically, it takes input as a string and then goes through and adds up each "digit" multiplied by the base-10 position. Each digit is actually checked for its index-position in the string hexDigits which is used as the numerical value.
Assuming the number that it returns is actually larger than the programming language supports, you could build up an array of Ints that represent the entire number:
[214748364, 8]
would represent 2147483648 (a number that a Java int couldn't handle).
That's some php code I've just written:
function to_base10($input, $base)
{
$result = 0;
$length = strlen($input);
for ($x=$length-1; $x>=0; $x--)
$result += (int)$input[$x] * pow($base, ($length-1)-$x);
return $result;
}
It's dead simple: just a loop through every char of the input string
This works with any base <10 but it can be easily extended to support higher bases (A->11, B->12, etc)
edit: oh didn't see the python code :)
yeah, that's cooler
I would choose a language which more or less supports natively math representation like 'lisp'. I know it seems less and less people use it, but it still has its value.
I don't know if this is large enough for your usage, but the largest integer number I could represent in my common lisp environment (CLISP) was 2^(2^20)
>> (expt 2 (expt 2 20)
In lisp you can easily represent hex, dec, oct and bin as follows
>> \#b1010
10
>> \#o12
10
>> 10
10
>> \#x0A
10
You can write rationals in other bases from 2 to 36 with #nR
>> #36rABCDEFGHIJKLMNOPQRSTUVWXYZ
8337503854730415241050377135811259267835
For more information on numbers in lisp see: Practical Common Lisp Book