I've been learning elixir this month, and was in a situation where I wanted to convert a binary object into a list of bits, for pattern matching.
My research led me here, to an article showing a method for doing so. However, I don't fully understand one of the arguments passed to the extract function.
I could just copy and paste the code, but I'd like to understand what's going on under the hood here.
The argument is this: <<b :: size(1), bits :: bitstring>>.
What I understand
I understand that << x >> denotes a binary object x. Logically to me, it looks as though this is similar to performing: [head | tail] = list on a List, to get the first element, and then the remaining ones as a new list called tail.
What I don't understand
However, I'm not familiar with the syntax, and I have never seen :: in elixir, nor have I ever seen a binary object separated by a comma: ,. I also, haven't seen size(x) used in Elixir, and have never encountered a bitstring.
The Bottom Line
If someone, could explain exactly how the syntax for this argument breaks down, or point me towards a resource I would highly appreciate it.
For your convenience, the code from that article:
defmodule Bits do
# this is the public api which allows you to pass any binary representation
def extract(str) when is_binary(str) do
extract(str, [])
end
# this function does the heavy lifting by matching the input binary to
# a single bit and sends the rest of the bits recursively back to itself
defp extract(<<b :: size(1), bits :: bitstring>>, acc) when is_bitstring(bits) do
extract(bits, [b | acc])
end
# this is the terminal condition when we don't have anything more to extract
defp extract(<<>>, acc), do: acc |> Enum.reverse
end
IO.inspect Bits.extract("!!") # => [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
IO.inspect Bits.extract(<< 99 >>) #=> [0, 1, 1, 0, 0, 0, 1, 1]
Elixir pattern matching seems mind blowingly easy to use for
structured binary data.
Yep. You can thank the erlang inventors.
According to the documentation, <<x :: size(y)>> denotes a bitstring,
whos decimal value is x and is represented by a string of bits that is
y in length.
Let's dumb it down a bit: <<x :: size(y)>> is the integer x inserted into y bits. Examples:
<<1 :: size(1)>> => 1
<<1 :: size(2)>> => 01
<<1 :: size(3)>> => 001
<<2 :: size(3)>> => 010
<<2 :: size(4)>> => 0010
The number of bits in the binary type is divisible by 8, so a binary type has a whole number of bytes (1 byte = 8 bits). The number of bits in a bitstring is not divisible by 8. That's the difference between the binary type and the bitstring type.
I understand that << x >> denotes a binary object x. Logically to me,
it looks as though this is similar to performing: [head | tail] = list
on a List, to get the first element, and then the remaining ones as a
new list called tail.
Yes:
defmodule A do
def show_list([]), do: :ok
def show_list([head|tail]) do
IO.puts head
show_list(tail)
end
def show_binary(<<>>), do: :ok
def show_binary(<<char::binary-size(1), rest::binary>>) do
IO.puts char
show_binary(rest)
end
end
In iex:
iex(6)> A.show_list(["a", "b", "c"])
a
b
c
:ok
iex(7)> "abc" = <<"abc">> = <<"a", "b", "c">> = <<97, 98, 99>>
"abc"
iex(9)> A.show_binary(<<97, 98, 99>>)
a
b
c
:ok
Or you can interpret the integers in the binary as plain old integers:
def show(<<>>), do: :ok
def show(<<ascii_code::integer-size(8), rest::binary>>) do
IO.puts ascii_code
show(rest)
end
In iex:
iex(6)> A.show(<<97, 98, 99>>)
97
98
99
:ok
The utf8 type is super useful because it will grab as many bytes as required to get a whole utf8 character:
def show(<<>>), do: :ok
def show(<<char::utf8, rest::binary>>) do
IO.puts char
show(rest)
end
In iex:
iex(8)> A.show("ۑ")
8364
235
:ok
As you can see, the uft8 type returns the unicode codepoint of the character. To get the character as a string/binary:
def show(<<>>), do: :ok
def show(<<codepoint::utf8, rest::binary>>) do
IO.puts <<codepoint::utf8>>
show(rest)
end
You take the codepoint(an integer) and use it to create the binary/string <<codepoint::utf8>>.
In iex:
iex(1)> A.show("ۑ")
€
ë
:ok
You can't specify a size for the utf8 type, though, so if you want to read multiple utf8 characters, you have to specify multiple segments.
And of course, the segment rest::binary, i.e. a binary type with no size specified, is super useful. It can only appear at the end of a pattern, and rest::binary is like the greedy regex: (.*). The same goes for rest::bitstring.
Although the elixir docs don't mention it anywhere, the total number of bits in a segment, where a segment is one of those things:
| | |
v v v
<< 1::size(8), 1::size(16), 1::size(1) >>
is actually unit * size, where each type has a default unit. The default type for a segment is integer, so the type for each segment above defaults to integer. An integer has a default unit of 1 bit, so the total number of bits in the first segment is: 8 * 1 bit = 8 bits. The default unit for the binary type is 8 bits, so a segment like:
<< char::binary-size(6)>>
has a total size of 6 * 8 bits = 48 bits. Equivalently, size(6) is just the number of bytes. You can specify the unit just like you can the size, e.g. <<1::integer-size(2)-unit(3)>>. The total bit size of that segment is: 2 * 3 bits = 6 bits.
However, I'm not familiar with the syntax
Check this out:
def bitstr2bits(bitstr) do
for <<bit::integer-size(1) <- bitstr>>, do: bit
end
In iex:
iex(17)> A.bitstr2bits <<1::integer-size(2), 2::integer-size(2)>>
[0, 1, 1, 0]
Equivalently:
iex(3)> A.bitstr2bits(<<0b01::integer-size(2), 0b10::integer-size(2)>>)
[0, 1, 1, 0]
Elixir tends to abstract away recursion with library functions, so usually you don't have to come up with your own recursive definitions like at your link. However, that link shows one of the standard, basic recursion tricks: adding an accumulator to the function call to gather results that you want the function to return. That function could also be written like this:
def bitstr2bits(<<>>), do: []
def bitstr2bits(<<bit::integer-size(1), rest::bitstring>>) do
[bit | bitstr2bits(rest)]
end
The accumulator function at the link is tail recursive, which means it takes up a constant (small) amount of memory--no matter how many recursive function calls are needed to step through the bitstring. A bitstring with 10 million bits? Requiring 10 million recursive function calls? That would only require a small amount of memory. In the old days, the alternate definition I posted could potentially crash your program because it would take up more and more memory for each recursive function call, and if the bitstring were long enough the amount of memory needed would be too large, and you would get stackoverflow and your program would crash. However, erlang has optimized away the disadvantages of recursive functions that are not tail recursive.
You can read about all these here, short answer:
:: is similar as guard, like a when is_integer(a), in you case size(1) expect a 1 bit binary
, is a separator between matching binaries, like | in [x | []] or like comma in [a, b]
bitstring is a superset over binaries, you can read about it here and here, any binary can be respresented as bitstring
iex> ?h
104
iex> ?e
101
iex> ?l
108
iex> ?o
111
iex> <<104, 101, 108, 108, 111>>
"hello"
iex> [104, 101, 108, 108, 111]
'hello'
but not vice versa
iex> <<1, 2, 3>>
<<1, 2, 3>>
After some research, I realized I overlooked some important information located at: elixir-lang.
According to the documentation, <<x :: size(y)>> denotes a bitstring, whos decimal value is x and is represented by a string of bits that is y in length.
Furthermore, <<binary>> will always attempt to conglomerate values in a left-first direction, into bytes or 8-bits, however, if the number of bits is not divisible by 8, there will by x bytes, followed by a bitstring.
For example:
iex> <<3::size(5), 5::size(6)>> # <<00011, 000101>>
<<24, 5::size(3)>> # automatically shifted to: <<00011000(24) , 101>>
Now, elixir also lets us pattern match binaries, and bitstrings like so:
iex> <<3 :: size(2), b :: bitstring>> = <<61 :: size(6)>> # [11] [1101]
iex> b
<<13 :: size(4)>> # [1101]
So, i completly misunderstood binaries and biststrings, and pattern matching between the two.
Not really the answer to the question stated, but I’d put it here for the sake of formatting. In elixir we usually use Kernel.SpecialForms.for/1 comprehension for bitstring generation.
for << b :: size(1) <- "!!" >>, do: b
#⇒ [0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]
for << b :: size(1) <- <<99>> >>, do: b
#⇒ [0, 1, 1, 0, 0, 0, 1, 1]
I wanted to use the bits, in an 8 bit binary to toggle conditions. So
[b1, b2, ...] = extract(<<binary>>)
I then wanted to say:
if b1, do: x....
if b2, do: y...
Is there a better way to do what I'm trying to do, instead of pattern
matching?
First of all, the only terms that evaluate to false in elixir are false and nil (just like in ruby):
iex(18)> x = 1
1
iex(19)> y = 0
0
iex(20)> if x, do: IO.puts "I'm true."
I'm true.
:ok
iex(21)> if y, do: IO.puts "I'm true."
I'm true.
:ok
Although, the fix is easy:
if b1 == 1, do: ...
Extracting the bits into a list is unnecessary because you can just iterate the bitstring:
def check_bits(<<>>), do: :ok
def check_bits(<<bit::integer-size(1), rest::bitstring>>) do
if bit == 1, do: IO.puts "bit is on"
check_bits(rest)
end
In other words, you can treat a bitstring similarly to a list.
Or, instead of performing the logic in the body of the function to determine whether the bit is 1, you can use pattern matching in the head of the function:
def check_bits(<<>>), do: :ok
def check_bits(<< 1::integer-size(1), rest::bitstring >>) do
IO.puts "The bit is 1."
check_bits(rest)
end
def check_bits(<< 0::integer-size(1), rest::bitstring >>) do
IO.puts "The bit is 0."
check_bits(rest)
end
Instead of using a variable, bit, for the match like here:
bit::integer-size(1)
...you use a literal value, 1:
1::integer-size(1)
The only thing that can match a literal value is the literal value itself. As a result, the pattern:
<< 1::integer-size(1), rest::bitstring >>
will only match a bitstring where the first bit, integer-size(1), is 1. The literal matching employed there is similar to doing the following with a list:
def list_contains_4([4|_tail]) do
IO.puts "found a 4"
true #end the recursion and return true
end
def list_contains_4([head|tail]) do
IO.puts "#{head} is not a 4"
list_contains_4(tail)
end
def list_contains_4([]), do: false
The first function clause tries to match the literal 4 at the head of the list. If the head of the list is not 4, there's no match; so elixir moves on to the next function clause, and in the next function clause the variable head will match anything in the list.
Using pattern matching in the head of a function rather than performing logic in the body of a function is considered more stylish and efficient in erlang.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In response to this question asking about hex to (raw) binary conversion, a comment suggested that it could be solved in "5-10 lines of C, or any other language."
I'm sure that for (some) scripting languages that could be achieved, and would like to see how. Can we prove that comment true, for C, too?
NB: this doesn't mean hex to ASCII binary - specifically the output should be a raw octet stream corresponding to the input ASCII hex. Also, the input parser should skip/ignore white space.
edit (by Brian Campbell) May I propose the following rules, for consistency? Feel free to edit or delete these if you don't think these are helpful, but I think that since there has been some discussion of how certain cases should work, some clarification would be helpful.
The program must read from stdin and write to stdout (we could also allow reading from and writing to files passed in on the command line, but I can't imagine that would be shorter in any language than stdin and stdout)
The program must use only packages included with your base, standard language distribution. In the case of C/C++, this means their respective standard libraries, and not POSIX.
The program must compile or run without any special options passed to the compiler or interpreter (so, 'gcc myprog.c' or 'python myprog.py' or 'ruby myprog.rb' are OK, while 'ruby -rscanf myprog.rb' is not allowed; requiring/importing modules counts against your character count).
The program should read integer bytes represented by pairs of adjacent hexadecimal digits (upper, lower, or mixed case), optionally separated by whitespace, and write the corresponding bytes to output. Each pair of hexadecimal digits is written with most significant nibble first.
The behavior of the program on invalid input (characters besides [a-fA-F \t\r\n], spaces separating the two characters in an individual byte, an odd number of hex digits in the input) is undefined; any behavior (other than actively damaging the user's computer or something) on bad input is acceptable (throwing an error, stopping output, ignoring bad characters, treating a single character as the value of one byte, are all OK)
The program may write no additional bytes to output.
Code is scored by fewest total bytes in the source file. (Or, if we wanted to be more true to the original challenge, the score would be based on lowest number of lines of code; I would impose an 80 character limit per line in that case, since otherwise you'd get a bunch of ties for 1 line).
edit Checkers has reduced my C solution to 46 bytes, which was then reduced to 44 bytes thanks to a tip from BillyONeal plus a bugfix on my part (no more infinite loop on bad input, now it just terminates the loop). Please give credit to Checkers for reducing this from 77 to 46 bytes:
main(i){while(scanf("%2x",&i)>0)putchar(i);}
And I have a much better Ruby solution than my last, in 42 38 bytes (thanks to Joshua Swank for the regexp suggestion):
STDIN.read.scan(/\S\S/){|x|putc x.hex}
original solutions
C, in 77 bytes, or two lines of code (would be 1 if you could put the #include on the same line). Note that this has an infinite loop on bad input; the 44 byte solution with the help of Checkers and BillyONeal fixes the bug, and simply stops on bad input.
#include <stdio.h>
int main(){char c;while(scanf("%2x",&c)!=EOF)putchar(c);}
It's even just 6 lines if you format it normally:
#include <stdio.h>
int main() {
char c;
while (scanf("%2x",&c) != EOF)
putchar(c);
}
Ruby, 79 bytes (I'm sure this can be improved):
STDOUT.write STDIN.read.scan(/[^\s]\s*[^\s]\s*/).map{|x|x.to_i(16)}.pack("c*")
These both take input from STDIN and write to STDOUT
39 char perl oneliner
y/A-Fa-f0-9//dc,print pack"H*",$_ for<>
Edit: wasn't really accepting uppercase, fixed.
45 byte executable (base64 encoded):
6BQAitjoDwDA4AQI2LQCitDNIevrWMOy/7QGzSF09jLkBMAa5YDkByrEJA/D
(paste into a file with a .com extension)
EDIT: Ok, here's the code. Open a Window's console, create a file with 45 bytes called 'hex.com', type "debug hex.com" then 'a' and enter. Copy and paste these lines:
db e8,14,00,8a,d8,e8,0f,00,c0,e0,04,08,d8,b4,02,8a,d0,cd,21,eb,eb,cd,20
db b2,ff,b4,06,cd,21,74,f6,32,e4,04,c0,1a,e5,80,e4,07,2a,c4,24,0f,c3
Press enter, 'w' and then enter again, 'q' and enter. You can now run 'hex.com'
EDIT2: Made it two bytes smaller!
db e8, 11, 00, 8a, d8, e8, 0c, 00, b4, 02, 02, c0, 67, 8d, 14, c3
db cd, 21, eb, ec, ba, ff, 00, b4, 06, cd, 21, 74, 0c, 04, c0, 18
db ee, 80, e6, 07, 28, f0, 24, 0f, c3, cd, 20
That was tricky. I can't believe I spent time doing that.
Brian's 77-byte C solution can be improved to 44 bytes, thanks to leniency of C with regard to function prototypes.
main(i){while(scanf("%2x",&i)>0)putchar(i);}
In Python:
binary = binascii.unhexlify(hex_str)
ONE LINE! (Yes, this is cheating.)
EDIT: This code was written a long time before the question edit which fleshed out the requirements.
Given that a single line of C can contain a huge number of statements, it's almost certainly true without being useful.
In C# I'd almost certainly write it in more than 10 lines, even though it would be feasible in 10. I'd separate out the "parse nybble" part from the "convert a string to a byte array" part.
Of course, if you don't care about spotting incorrect lengths etc, it becomes a bit easier. Your original text also contained spaces - should those be skipped, validated, etc? Are they part of the required input format?
I rather suspect that the comment was made without consideration as to what a pleasant, readable solution would look like.
Having said that, here's a hideous version in C#. For bonus points, it uses LINQ completely inappropriately in an effort to save a line or two of code. The lines could be longer, of course...
using System;
using System.Linq;
public class Test
{
static void Main(string[] args)
{
byte[] data = ParseHex(args[0]);
Console.WriteLine(BitConverter.ToString(data));
}
static byte[] ParseHex(string text)
{
Func<char, int> parseNybble = c => (c >= '0' && c <= '9') ? c-'0' : char.ToLower(c)-'a'+10;
return Enumerable.Range(0, text.Length/2)
.Select(x => (byte) ((parseNybble(text[x*2]) << 4) | parseNybble(text[x*2+1])))
.ToArray();
}
}
(This is avoiding "cheating" by using any built-in hex parsing code, such as Convert.ToByte(string, 16). Aside from anything else, that would mean losing the use of the word nybble, which is always a bonus.)
Perl
In, of course, one (fairly short) line:
my $bin = map { chr hex } ($hex =~ /\G([0-9a-fA-F]{2})/g);
Haskell:
import Data.Char
import Numeric
import System.IO
import Foreign
main = hGetContents stdin >>=
return.fromHexStr.filter (not.isSpace) >>=
mapM_ (writeOneByte stdout)
fromHexStr (a:b:tl) = fromHexDgt [a,b]:fromHexStr tl
fromHexStr [] = []
fromHexDgt str = case readHex str of
[(i,"")] -> fromIntegral (i)
s -> error$show s
writeOneByte h i = allocaBytes 1 (wob' h i)
wob' :: Handle -> Int8 -> (Ptr Int8) -> IO ()
wob' h i ptr = poke ptr i >> hPutBuf h ptr 1
Gah.
You aren't allowed to call me on my off-the-cuff estimates! ;-P
Here's a 9 line C version with no odd formatting (Well, I'll grant you that the tohex array would be better split into 16 lines so you can see which character codes map to which values...), and only 2 shortcuts that I wouldn't deploy in anything other than a one-off script:
#include <stdio.h>
char hextonum[256] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0, 0, 0, 0, 0, 0,10,11,12,13,14,15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,10,11,12,13,14,15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
char input[81]="8b1f0008023149f60300f1f375f40c72f77508507676720c560d75f002e5ce000861130200000000";
void main(void){
int i = 0;
FILE *fd = fopen("outfile.bin", "wb");
while((input[i] != 0) && (input[i+1] != 0))
fputc(hextonum[input[i++]] * 16 + hextonum[input[i++]], fd);
}
No combined lines (each statement is given its own line), it's perfectly readable, etc. An obfuscated version could undoubtedly be shorter, one could cheat and put the close braces on the same line as the preceding statement, etc, etc, etc.
The two things I don't like about it is that I don't have a close(fd) in there, and main shouldn't be void and should return an int. Arguably they're not needed - the OS will release every resource the program used, the file will close without any problems, and the compiler will take care of the program exit value. Given that it's a one-time use script, it's acceptable, but don't deploy this.
It becomes eleven lines with both, so it's not a huge increase anyway, and a ten line version would include one or the other depending on which one might feel is the lessor of two evils.
It doesn't do any error checking, and it doesn't allow whitespace - assuming, again, that it's a one time program then it's faster to do search/replace and get rid of spaces and other whitespace before running the script, however it shouldn't need more than another few lines to eat whitespace as well.
There are, of course, ways to make it shorter but they would likely decrease readability significantly...
Hmph. Just read the comment about line length, so here's a newer version with an uglier hextonum macro, rather than the array:
#include <stdio.h>
#define hextonum(x) (((x)<'A')?((x)-'0'):(((x)<'a')?((x)+10-'A'):((x)+10-'a')))
char input[81]="8b1f0008023149f60300f1f375f40c72f77508507676720c560d75f002e5ce000861130200000000";
void main(void){
int i = 0;
FILE *fd = fopen("outfile.bin", "wb");
for(i=0;(input[i] != 0) && (input[i+1] != 0);i+=2)
fputc(hextonum(input[i]) * 16 + hextonum(input[i+1]), fd);
}
It isn't horribly unreadable, but I know many people have issues with the ternary operator, but the appropriate naming of the macro and some analysis should readily yield how it works to the average C programmer. Due to side effects in the macro I had to move to a for loop so I didn't have to have another line for i+=2 (hextonum(i++) will increment i by 5 each time it's called, macro side effects are not for the faint of heart!).
Also, the input parser should skip/ignore white space.
grumble, grumble, grumble.
I had to add a few lines to take care of this requirement, now up to 14 lines for a reasonably formatted version. It will ignore everything that's not a hexadecimal character:
#include <stdio.h>
int hextonum[] = {-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,-1,-1,-1,-1,-1,-1,-1,10,11,12,13,14,15,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,10,11,12,13,14,15,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
char input[]="8b1f 0008 0231 49f6 0300 f1f3 75f4 0c72 f775 0850 7676 720c 560d 75f0 02e5 ce00 0861 1302 0000 0000";
void main(void){
unsigned char i = 0, nibble = 1, byte = 0;
FILE *fd = fopen("outfile.bin", "wb");
for(i=0;input[i] != 0;i++){
if(hextonum[input[i]] == -1)
continue;
byte = (byte << 4) + hextonum[input[i]];
if((nibble ^= 0x01) == 0x01)
fputc(byte, fd);
}
}
I didn't bother with the 80 character line length because the input isn't even less than 80 characters, but a 3 level ternary macro could replace the first 256 entry array. If one didn't mind a bit of "alternative formatting" then the following 10 line version isn't completely unreadable:
#include <stdio.h>
int hextonum[] = {-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,-1,-1,-1,-1,-1,-1,-1,10,11,12,13,14,15,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,10,11,12,13,14,15,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1};
char input[]="8b1f 0008 0231 49f6 0300 f1f3 75f4 0c72 f775 0850 7676 720c 560d 75f0 02e5 ce00 0861 1302 0000 0000";
void main(void){
unsigned char i = 0, nibble = 1, byte = 0;
FILE *fd = fopen("outfile.bin", "wb");
for(i=0;input[i] != 0;i++){
if(hextonum[input[i]] == -1) continue;
byte = (byte << 4) + hextonum[input[i]];
if((nibble ^= 0x01) == 0x01) fputc(byte, fd);}}
And, again, further obfuscation and bit twiddling could result in an even shorter example.
.
Its an language called "Hex!". Its only usage is to read hex data from stdin and output it to stdout.
Hex! is parsed by an simple python script.
import sys
try:
data = open(sys.argv[1], 'r').read()
except IndexError:
data = raw_input("hex!> ")
except Exception as e:
print "Error occurred:",e
if data == ".":
hex = raw_input()
print int(hex, 16)
else:
print "parsing error"
Fairly readably C solution (9 "real" lines):
#include <stdio.h>
int getNextHexDigit() {
int v;
while((v = fgetc(stdin)) < '0' && v != -1) { /* Until non-whitespace or EOF */
}
return v > '9' ? 9 + (v & 0x0F) : v - '0'; /* Extract number from hex digit (ASCII) */
}
int main() {
int v;
fputc(v = (getNextHexDigit() << 4) | getNextHexDigit(), stdout);
return v > 0 ? main(0) : 0;
}
To support 16-bit little endian goodness, replace main with:
int main() {
int v, q;
v = (getNextHexDigit() << 4) | getNextHexDigit();
fputc(q = (getNextHexDigit() << 4) | getNextHexDigit(), stdout);
fputc(v, stdout);
return (v | q) > 0 ? main(0) : 0;
}
A 31-character Perl solution:
s/\W//g,print(pack'H*',$_)for<>
I can't code this off the top of my head, but for every two characters, output (byte)((AsciiValueChar1-(AsciiValueChar1>64?48:55)*16)+(AsciiValueChar1-(AsciiValueChar1>64?48:55))) to get a hex string changed into raw binary. This would break horribly if your input string has anything other than 0 to 9 or A to B, so I can't say how useful it would be to you.
I know Jon posted a (cleaner) LINQ solution already. But for once I am able to use a LINQ statement which modifies a string during its execution and abuses LINQ's deferred evaluation without getting yelled at by my co-workers. :p
string hex = "FFA042";
byte[] bytes =
hex.ToCharArray()
.Select(c => ('0' <= c && c <= '9') ?
c - '0' :
10 + (('a' <= c) ? c - 'a' : c - 'A'))
.Select(c => (hex = hex.Remove(0, 1)).Length > 0 ? (new int[] {
c,
hex.ToCharArray()
.Select(c2 => ('0' <= c2 && c2 <= '9') ?
c2 - '0' :
10 + (('a' <= c2) ? c2 - 'a' : c2 - 'A'))
.FirstOrDefault() }) : ( new int[] { c } ) )
.Where(c => (hex.Length % 2) == 1)
.Select(ca => ((byte)((ca[0] << 4) + ca[1]))).ToArray();
1 statement formatted for readability.
Update
Support for spaces and uneven amount of decimals (89A is equal to 08 9A)
byte[] bytes =
hex.ToCharArray()
.Where(c => c != ' ')
.Reverse()
.Select(c => (char)(c2 | 32) % 39 - 9)
.Select(c =>
(hex =
new string('0',
(2 + (hex.Replace(" ", "").Length % 2)) *
hex.Replace(" ", "")[0].CompareTo('0')
.CompareTo(0)) +
hex.Replace(" ", "").Remove(hex.Replace(" ", "").Length - 1))
.Length > 0 ? (new int[] {
hex.ToCharArray()
.Reverse()
.Select(c2 => (char)(c2 | 32) % 39 - 9)
.FirstOrDefault(), c }) : new int[] { 0, c } )
.Where(c => (hex.Length % 2) == 1)
.Select(ca => ((byte)((ca[0] << 4) + ca[1])))
.Reverse().ToArray();
Still one statement. Could be made much shorter by running the replace(" ", "") on hex string in the start, but this would be a second statement.
Two interesting points with this one. How to track the character count without the help of outside variables other than the source string itself. While solving this I encountered the fact that char y.CompareTo(x) just returns "y - x" while int y.CompareTo(x) returns -1, 0 or 1. So char y.CompareTo(x).CompareTo(0) equals a char comparison which returns -1, 0 or 1.
PHP, 28 symbols:
<?=pack(I,hexdec($argv[1]));
Late to the game, but here's some Python{2,3} one-liner (100 chars, needs import sys, re):
sys.stdout.write(''.join([chr(int(x,16)) for x in re.findall(r'[A-Fa-f0-9]{2}', sys.stdin.read())]))