How can I convert a bitstring to the binary form in Julia - binary

I am using bitstring to perform an xor operation on the ith bit of a string:
string = bitstring(string ⊻ 1 <<i)
However the result will be a string, so I cannot continue with other i.
So I want to know how do I convert a bitstring (of the form “000000000000000000000001001”) to (0b1001)?
Thanks

You can use parse to create an integer from the string, and then use string (alt. bitstring)to go the other way. Examples:
julia> str = "000000000000000000000001001";
julia> x = parse(UInt, str; base=2) # parse as UInt from input in base 2
0x0000000000000009
julia> x == 0b1001
true
julia> string(x; base=2) # stringify in base 2
"1001"
julia> bitstring(x) # stringify as bits (64 bits since UInt64 is 64 bits)
"0000000000000000000000000000000000000000000000000000000000001001"

don't use bitstring. You can either do the math with a BitVector or just a UInt. No reason to bring a String into it.

Related

Cythonize: check if word in list of strings is a substring of another string

I want to iterate over a list of input words list_words and check if any belongs to an input string.
I tried to cythonize the code but when I annotate it I see almost all of it in yellow, suggesting python interactions.
Not sure how I could speedup this:
cpdef cy_check_any_word_is_substring(list_words, string):
cdef unicode w
cdef unicode s_lowered = string.lower()
for w in list_words:
if w in s_lowered:
return True
return False
Example
# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
input_string = 'The animal saw the Dog and started to make noises'
# should return true
cy_check_any_word_is_substring(list_words, input_string)
Note I want to make the code work independently if characters are capitalized or not (that is why I do string.lower()), I assume the input list of words is already lowered.
Update
I wonder if a solution that uses C++ could be faster.
I don't know C++ though, I tried
from libcpp.vector cimport vector
from libcpp.string cimport string
cpdef cy_check_any_word_is_substring(vector[string] list_words,string string):
s_lowered = string.lower()
for w in list_words:
if w in s_lowered:
return True
return False
But it produces the error
Invalid types for 'in' (string, Python object)
Update 2
I tried the following to avoid the error presented in the previous section update.
from libcpp.vector cimport vector
from libcpp.string cimport string,npos
cdef bint cy_check_w_substring(string s_lowered, vector[string] list_words):
cdef string w
for w in list_words:
if s_lowered.find(w) !=npos:
return True
return False
cpdef cy3_check_any_word_is_substring(words_bytes, input_string):
cdef bint result = False
s_lowered = input_string.lower()
result = cy_check_w_substring(bytes(s_lowered, 'utf8'), words_bytes)
return result
This can be used changing the original list of words as a list of bytes.
# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
list_words_bytes = [bytes(w,'utf8') for w in list_words]
input_string = 'The animal saw the Dog and started to make noises'
# should return true
cy3_check_any_word_is_substring(list_words_bytes, input_string)
Nevertheless this is much slower
%%timeit
cy3_check_any_word_is_substring(list_words_bytes, input_string)
#1.01 µs ± 3.16 ns per loop
%%timeit
cy_check_any_word_is_substring(list_words, input_string)
#190 ns ± 0.773 ns per loop
Note cy3_check_any_word_is_substring internally casts s_lowered as bytes but this already takes 145 ns which is almost the cost of cy_check_any_word_is_substring which makes this approach clearly not viable.
%%timeit
bytes(input_string, 'utf8')
#145 ns ± 0.55 ns per loop
The basic problem with the C++ solution is that if you pass it a Python iterable there's a hidden type conversion. So it has to iterate through the entire list and then convert every string to a C++ string. For this reason I doubt it'll give you much benefit.
If you can generate the data as a C++ vector without the type conversion then it may work better. For this you should use a cdef function instead of a cpdef function (I rarely like cpdef functions because they're usually the worst of both worlds).
The specific problems you have:
The C++ string class does not have a .lower() function, so the line s_lowered = string.lower() is implicitly converting it back to a Python bytes then calling .lower() on that. You'll have to implement .lower yourself (or convert to the C++ string after calling .lower on the Python object).
w in s_lowered isn't implemented for C++ strings. You want s_lowered.find(w) != npos (where npos is cimported from libcpp.string).

Comma separated binary arguments? - elixir

I've been learning elixir this month, and was in a situation where I wanted to convert a binary object into a list of bits, for pattern matching.
My research led me here, to an article showing a method for doing so. However, I don't fully understand one of the arguments passed to the extract function.
I could just copy and paste the code, but I'd like to understand what's going on under the hood here.
The argument is this: <<b :: size(1), bits :: bitstring>>.
What I understand
I understand that << x >> denotes a binary object x. Logically to me, it looks as though this is similar to performing: [head | tail] = list on a List, to get the first element, and then the remaining ones as a new list called tail.
What I don't understand
However, I'm not familiar with the syntax, and I have never seen :: in elixir, nor have I ever seen a binary object separated by a comma: ,. I also, haven't seen size(x) used in Elixir, and have never encountered a bitstring.
The Bottom Line
If someone, could explain exactly how the syntax for this argument breaks down, or point me towards a resource I would highly appreciate it.
For your convenience, the code from that article:
defmodule Bits do
# this is the public api which allows you to pass any binary representation
def extract(str) when is_binary(str) do
extract(str, [])
end
# this function does the heavy lifting by matching the input binary to
# a single bit and sends the rest of the bits recursively back to itself
defp extract(<<b :: size(1), bits :: bitstring>>, acc) when is_bitstring(bits) do
extract(bits, [b | acc])
end
# this is the terminal condition when we don't have anything more to extract
defp extract(<<>>, acc), do: acc |> Enum.reverse
end
IO.inspect Bits.extract("!!") # => [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
IO.inspect Bits.extract(<< 99 >>) #=> [0, 1, 1, 0, 0, 0, 1, 1]
Elixir pattern matching seems mind blowingly easy to use for
structured binary data.
Yep. You can thank the erlang inventors.
According to the documentation, <<x :: size(y)>> denotes a bitstring,
whos decimal value is x and is represented by a string of bits that is
y in length.
Let's dumb it down a bit: <<x :: size(y)>> is the integer x inserted into y bits. Examples:
<<1 :: size(1)>> => 1
<<1 :: size(2)>> => 01
<<1 :: size(3)>> => 001
<<2 :: size(3)>> => 010
<<2 :: size(4)>> => 0010
The number of bits in the binary type is divisible by 8, so a binary type has a whole number of bytes (1 byte = 8 bits). The number of bits in a bitstring is not divisible by 8. That's the difference between the binary type and the bitstring type.
I understand that << x >> denotes a binary object x. Logically to me,
it looks as though this is similar to performing: [head | tail] = list
on a List, to get the first element, and then the remaining ones as a
new list called tail.
Yes:
defmodule A do
def show_list([]), do: :ok
def show_list([head|tail]) do
IO.puts head
show_list(tail)
end
def show_binary(<<>>), do: :ok
def show_binary(<<char::binary-size(1), rest::binary>>) do
IO.puts char
show_binary(rest)
end
end
In iex:
iex(6)> A.show_list(["a", "b", "c"])
a
b
c
:ok
iex(7)> "abc" = <<"abc">> = <<"a", "b", "c">> = <<97, 98, 99>>
"abc"
iex(9)> A.show_binary(<<97, 98, 99>>)
a
b
c
:ok
Or you can interpret the integers in the binary as plain old integers:
def show(<<>>), do: :ok
def show(<<ascii_code::integer-size(8), rest::binary>>) do
IO.puts ascii_code
show(rest)
end
In iex:
iex(6)> A.show(<<97, 98, 99>>)
97
98
99
:ok
The utf8 type is super useful because it will grab as many bytes as required to get a whole utf8 character:
def show(<<>>), do: :ok
def show(<<char::utf8, rest::binary>>) do
IO.puts char
show(rest)
end
In iex:
iex(8)> A.show("ۑ")
8364
235
:ok
As you can see, the uft8 type returns the unicode codepoint of the character. To get the character as a string/binary:
def show(<<>>), do: :ok
def show(<<codepoint::utf8, rest::binary>>) do
IO.puts <<codepoint::utf8>>
show(rest)
end
You take the codepoint(an integer) and use it to create the binary/string <<codepoint::utf8>>.
In iex:
iex(1)> A.show("ۑ")
€
ë
:ok
You can't specify a size for the utf8 type, though, so if you want to read multiple utf8 characters, you have to specify multiple segments.
And of course, the segment rest::binary, i.e. a binary type with no size specified, is super useful. It can only appear at the end of a pattern, and rest::binary is like the greedy regex: (.*). The same goes for rest::bitstring.
Although the elixir docs don't mention it anywhere, the total number of bits in a segment, where a segment is one of those things:
| | |
v v v
<< 1::size(8), 1::size(16), 1::size(1) >>
is actually unit * size, where each type has a default unit. The default type for a segment is integer, so the type for each segment above defaults to integer. An integer has a default unit of 1 bit, so the total number of bits in the first segment is: 8 * 1 bit = 8 bits. The default unit for the binary type is 8 bits, so a segment like:
<< char::binary-size(6)>>
has a total size of 6 * 8 bits = 48 bits. Equivalently, size(6) is just the number of bytes. You can specify the unit just like you can the size, e.g. <<1::integer-size(2)-unit(3)>>. The total bit size of that segment is: 2 * 3 bits = 6 bits.
However, I'm not familiar with the syntax
Check this out:
def bitstr2bits(bitstr) do
for <<bit::integer-size(1) <- bitstr>>, do: bit
end
In iex:
iex(17)> A.bitstr2bits <<1::integer-size(2), 2::integer-size(2)>>
[0, 1, 1, 0]
Equivalently:
iex(3)> A.bitstr2bits(<<0b01::integer-size(2), 0b10::integer-size(2)>>)
[0, 1, 1, 0]
Elixir tends to abstract away recursion with library functions, so usually you don't have to come up with your own recursive definitions like at your link. However, that link shows one of the standard, basic recursion tricks: adding an accumulator to the function call to gather results that you want the function to return. That function could also be written like this:
def bitstr2bits(<<>>), do: []
def bitstr2bits(<<bit::integer-size(1), rest::bitstring>>) do
[bit | bitstr2bits(rest)]
end
The accumulator function at the link is tail recursive, which means it takes up a constant (small) amount of memory--no matter how many recursive function calls are needed to step through the bitstring. A bitstring with 10 million bits? Requiring 10 million recursive function calls? That would only require a small amount of memory. In the old days, the alternate definition I posted could potentially crash your program because it would take up more and more memory for each recursive function call, and if the bitstring were long enough the amount of memory needed would be too large, and you would get stackoverflow and your program would crash. However, erlang has optimized away the disadvantages of recursive functions that are not tail recursive.
You can read about all these here, short answer:
:: is similar as guard, like a when is_integer(a), in you case size(1) expect a 1 bit binary
, is a separator between matching binaries, like | in [x | []] or like comma in [a, b]
bitstring is a superset over binaries, you can read about it here and here, any binary can be respresented as bitstring
iex> ?h
104
iex> ?e
101
iex> ?l
108
iex> ?o
111
iex> <<104, 101, 108, 108, 111>>
"hello"
iex> [104, 101, 108, 108, 111]
'hello'
but not vice versa
iex> <<1, 2, 3>>
<<1, 2, 3>>
After some research, I realized I overlooked some important information located at: elixir-lang.
According to the documentation, <<x :: size(y)>> denotes a bitstring, whos decimal value is x and is represented by a string of bits that is y in length.
Furthermore, <<binary>> will always attempt to conglomerate values in a left-first direction, into bytes or 8-bits, however, if the number of bits is not divisible by 8, there will by x bytes, followed by a bitstring.
For example:
iex> <<3::size(5), 5::size(6)>> # <<00011, 000101>>
<<24, 5::size(3)>> # automatically shifted to: <<00011000(24) , 101>>
Now, elixir also lets us pattern match binaries, and bitstrings like so:
iex> <<3 :: size(2), b :: bitstring>> = <<61 :: size(6)>> # [11] [1101]
iex> b
<<13 :: size(4)>> # [1101]
So, i completly misunderstood binaries and biststrings, and pattern matching between the two.
Not really the answer to the question stated, but I’d put it here for the sake of formatting. In elixir we usually use Kernel.SpecialForms.for/1 comprehension for bitstring generation.
for << b :: size(1) <- "!!" >>, do: b
#⇒ [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
for << b :: size(1) <- <<99>> >>, do: b
#⇒ [0, 1, 1, 0, 0, 0, 1, 1]
I wanted to use the bits, in an 8 bit binary to toggle conditions. So
[b1, b2, ...] = extract(<<binary>>)
I then wanted to say:
if b1, do: x....
if b2, do: y...
Is there a better way to do what I'm trying to do, instead of pattern
matching?
First of all, the only terms that evaluate to false in elixir are false and nil (just like in ruby):
iex(18)> x = 1
1
iex(19)> y = 0
0
iex(20)> if x, do: IO.puts "I'm true."
I'm true.
:ok
iex(21)> if y, do: IO.puts "I'm true."
I'm true.
:ok
Although, the fix is easy:
if b1 == 1, do: ...
Extracting the bits into a list is unnecessary because you can just iterate the bitstring:
def check_bits(<<>>), do: :ok
def check_bits(<<bit::integer-size(1), rest::bitstring>>) do
if bit == 1, do: IO.puts "bit is on"
check_bits(rest)
end
In other words, you can treat a bitstring similarly to a list.
Or, instead of performing the logic in the body of the function to determine whether the bit is 1, you can use pattern matching in the head of the function:
def check_bits(<<>>), do: :ok
def check_bits(<< 1::integer-size(1), rest::bitstring >>) do
IO.puts "The bit is 1."
check_bits(rest)
end
def check_bits(<< 0::integer-size(1), rest::bitstring >>) do
IO.puts "The bit is 0."
check_bits(rest)
end
Instead of using a variable, bit, for the match like here:
bit::integer-size(1)
...you use a literal value, 1:
1::integer-size(1)
The only thing that can match a literal value is the literal value itself. As a result, the pattern:
<< 1::integer-size(1), rest::bitstring >>
will only match a bitstring where the first bit, integer-size(1), is 1. The literal matching employed there is similar to doing the following with a list:
def list_contains_4([4|_tail]) do
IO.puts "found a 4"
true #end the recursion and return true
end
def list_contains_4([head|tail]) do
IO.puts "#{head} is not a 4"
list_contains_4(tail)
end
def list_contains_4([]), do: false
The first function clause tries to match the literal 4 at the head of the list. If the head of the list is not 4, there's no match; so elixir moves on to the next function clause, and in the next function clause the variable head will match anything in the list.
Using pattern matching in the head of a function rather than performing logic in the body of a function is considered more stylish and efficient in erlang.

How to lossless convert a double to string and back in Octave

When saving a double to a string there is some loss of precision. Even if you use a very large number of digits the conversion may not be reversible, i.e. if you convert a double x to a string sx and then convert back you will get a number x' which may not be bitwise equal to x. This may cause some problem for instance when checking for differences in a battery of tests. One possibility is to use binary form (for instance the native Binary form, or HDF5) but I want to store the number in a text file, so I need a conversion to a string. I have a working solution but I ask if there is some standard for this or a better solution.
In C/C++ you could cast the double to some integer type like char* and then convert each byte to an hexa of length 2 with printf("%02x",c[j]). Then for instance PI would be converted to a string of length 16: 54442d18400921fb. The problem with this is that if you read the hexa you don get any idea of which number it is. So I would be interested in some mix for instance pi -> 3.14{54442d18400921fb}. The first part is a (probably low precision) decimal representation of the number (typically I would use a "%g" output conversion) and the string in braces is the lossless hexadecimal representation.
EDIT: I pass the code as an aswer
Following the ideas already suggested in the post I wrote the
following functions, that seem to work.
function s = dbl2str(d);
z = typecast(d,"uint32");
s = sprintf("%.3g{%08x%08x}\n",d,z);
endfunction
function d = str2dbl(s);
k1 = index(s,"{");
k2 = index(s,"}");
## Check that there is a balanced {} or not at all
assert((k1==0) == (k2==0));
if k1>0; assert(k2>k1); endif
if (k1==0);
## If there is not {hexa} part convert with loss
d = str2double(s);
else
## Convert lossless
ss = substr(s,k1+1,k2-k1-1);
z = uint32(sscanf(ss,"%8x",2));
d = typecast(z,"double");
endif
endfunction
Then I have
>> spi=dbl2str(pi)
spi = 3.14{54442d18400921fb}
>> pi2 = str2dbl(spi)
pi2 = 3.1416
>> pi2-pi
ans = 0
>> snan = dbl2str(NaN)
snan = NaN{000000007ff80000}
>> nan1 = str2dbl(snan)
nan1 = NaN
A further improvement would be to use other type of enconding, for
instance Base64 (as suggested by #CrisLuengo in a comment) that would
reduce the length of the binary part from 16 to 11 bytes.

How to create a TCL variable of type bytearray

I am using TCL 8.4.20.
So I have the following code:
set a [binary format H2 1]
set b [binary format H2 2]
set c [binary format H2 3]
set bytes $a
append bytes $a
append bytes $b
append bytes $c
puts $bytes
I set a breakpoint at Tcl_PutsObjCmd() function in TCL's C source code and I see its argument, $bytes, is of type string while I expect it to be bytearray.
Question 1:Why is that? From the first assignment to the final appending, "bytes" accepts nothing but binary data.
The reason I do this experiment is, we have a TCL extension command in C, it expects the command argument is of byte array type - it has a check the value's typePtr should be tclByteArrayType. My TCL code currently fails on this command because the data passed to the command is of type string, just as demo'ed above.
I googled around, seems the "right" way to make a byte array object is to have every byte ready first and finally use one "binary format" command to put all into one. But it is a fairly big change to my current TCL code.
Question 2: Given that I already have a TCL variable whose data are all binaries (created using "binary format" for each byte and put together using "append") while its type is string, How can I change its internal type to "bytearray" through some TCL maneuvering?
Technically, the internal type is not a guaranteed property. Everything is a string. The code may shimmer a type away whenever it feels like. And code that depends on the internal type is usually very brittle or outright broken.
So your C code should call Tcl_GetByteArrayFromObj() instead of peeking at the arguments internals. That does the proper conversion if the object has not yet a byteArray representation.
About your questions:
Why doesn't append of two byte arrays keep the byte array type?
It does, at least for 8.6, if you do it right and never trigger the creation of a string rep.
Running this in tkcon, the append turns the value into a string:
() 98 % set a [binary format H2 1]

() 99 % set b [binary format H2 1]

() 100 % ::tcl::unsupported::representation $a
value is a bytearray with a refcount of 2, object pointer at 0000000005665420, internal representation 000000000587B280:0000000005665240, string representation ""
() 101 % ::tcl::unsupported::representation $b
value is a bytearray with a refcount of 2, object pointer at 000000000564EEB0, internal representation 000000000587B4A0:00000000056590E0, string representation ""
() 102 % set x $a

() 103 % ::tcl::unsupported::representation $x
value is a bytearray with a refcount of 4, object pointer at 0000000005665420, internal representation 000000000587B280:0000000005665240, string representation ""
() 104 % append x $b

() 105 % ::tcl::unsupported::representation $x
value is a string with a refcount of 3, object pointer at 0000000005663F50, internal representation 0000000005896BA0:000000000564F030, string representation ""
this happens, because the bytearray has a string rep (due to Tkcon echoing the value) created. The append optimization only works for 'pure' bytearrays, e.g. bytearrays that do not have a string rep. This is similar to some optimizations for 'pure' lists.
So it works like this, preventing the shimmering result echo:
() 106 % set b [binary format H2 1]; puts "pure"
pure
() 107 % set a [binary format H2 1]; puts "pure"
pure
() 108 % set x $a; puts "pure"
pure
() 109 % ::tcl::unsupported::representation $a
value is a bytearray with a refcount of 3, object pointer at 0000000005658780, internal representation 000000000587B320:0000000005658CF0, no string representation
() 110 % ::tcl::unsupported::representation $b
value is a bytearray with a refcount of 2, object pointer at 000000000564ED60, internal representation 000000000587B500:0000000005658750, no string representation
() 111 % ::tcl::unsupported::representation $x
value is a bytearray with a refcount of 3, object pointer at 0000000005658780, internal representation 000000000587B320:0000000005658CF0, no string representation
() 112 % append x $b; puts "pure"
pure
() 113 % ::tcl::unsupported::representation $x
value is a bytearray with a refcount of 2, object pointer at 0000000005658690, internal representation 00000000058A5C60:0000000005658960, no string representation
Note the no string representation part.
How to turn a string into a bytearray
Just do a binary format:
set x [binary format a* $x]

Converting to Base 10

Question
Let's say I have a string or array which represents a number in base N, N>1, where N is a power of 2. Assume the number being represented is larger than the system can handle as an actual number (an int or a double etc).
How can I convert that to a decimal string?
I'm open to a solution for any base N which satisfies the above criteria (binary, hex, ...). That is if you have a solution which works for at least one base N, I'm interested :)
Example:
Input: "10101010110101"
-
Output: "10933"
It depends on the particular language. Some have native support for arbitrary-length integers, and others can use libraries such as GMP. After that it's just a matter of doing the lookup in a table for the digit value, then multiplying as appropriate.
This is from a Python-based computer science course I took last semester that's designed to handle up to base-16.
import string
def baseNTodecimal():
# get the number as a string
number = raw_input("Please type a number: ")
# convert it to all uppercase to match hexDigits (below)
number = string.upper(number)
# get the base as an integer
base = input("Please give me the base: ")
# the number of values that we have to change to base10
digits = len(number)
base10 = 0
# first position of any baseN number is 1's
position = 1
# set up a string so that the position of
# each character matches the decimal
# value of that character
hexDigits = "0123456789ABCDEF"
# for each 'digit' in the string
for i in range(1, digits+1):
# find where it occurs in the string hexDigits
digit = string.find(hexDigits, number[-i])
# multiply the value by the base position
# and add it to the base10 total
base10 = base10 + (position * digit)
print number[-i], "is in the " + str(position) + "'s position"
# increase the position by the base (e.g., 8's position * 2 = 16's position)
position = position * base
print "And in base10 it is", base10
Basically, it takes input as a string and then goes through and adds up each "digit" multiplied by the base-10 position. Each digit is actually checked for its index-position in the string hexDigits which is used as the numerical value.
Assuming the number that it returns is actually larger than the programming language supports, you could build up an array of Ints that represent the entire number:
[214748364, 8]
would represent 2147483648 (a number that a Java int couldn't handle).
That's some php code I've just written:
function to_base10($input, $base)
{
$result = 0;
$length = strlen($input);
for ($x=$length-1; $x>=0; $x--)
$result += (int)$input[$x] * pow($base, ($length-1)-$x);
return $result;
}
It's dead simple: just a loop through every char of the input string
This works with any base <10 but it can be easily extended to support higher bases (A->11, B->12, etc)
edit: oh didn't see the python code :)
yeah, that's cooler
I would choose a language which more or less supports natively math representation like 'lisp'. I know it seems less and less people use it, but it still has its value.
I don't know if this is large enough for your usage, but the largest integer number I could represent in my common lisp environment (CLISP) was 2^(2^20)
>> (expt 2 (expt 2 20)
In lisp you can easily represent hex, dec, oct and bin as follows
>> \#b1010
10
>> \#o12
10
>> 10
10
>> \#x0A
10
You can write rationals in other bases from 2 to 36 with #nR
>> #36rABCDEFGHIJKLMNOPQRSTUVWXYZ
8337503854730415241050377135811259267835
For more information on numbers in lisp see: Practical Common Lisp Book