Related
I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.
I need to analyse a binary data file containing raw data from a scientific instrument. A quick look in a hex viewer indicates that's probably no encryption or anything fancy: integers will probably be written as integers (but I don't know what byte order), and who knows about floating point.
I have access to a (closed source) program that can view the contents of the file. So I can see that a certain value is 74078. Actually searching for that value I'm not sure about - do I search for 00 01 21 5E, some other byte order, etc? (Hex Fiend doesn't support searching for decimal values) And how would I find a floating point number?
The software that produces these files runs on XP. I'd prefer tools that run on OSX if possible.
(Hmm, I wrote up this question, forgot to post it, then solved the problem. I guess I will write my own answer.)
In the end, Hex Fiend turned out to be just enough. What I was expecting to do:
Convert a known value into hex
Search for it
What I actually did:
Pick a random chunk of hex that looked like it might be a useful value
Tell Hex Fiend to display it as integer, or as float, in either little endian or big endian, until it gave a plausible looking result (ie, 45.000 is a lot more plausible than some huge integer)
Search for that result in the results I had from the closed source program.
Document it, go back to step 1. (Except that normally the next chunk wouldn't be 'random', but would follow sequentially.)
In this case there were really only three (binary) variables for how to interpret data:
float or integer
2 bytes or 4 bytes
little or big endian
With more variables the task would be a lot harder. It would have been nice if Hex Fiend could search for integers/floats directly, perhaps trying out the different combinations. Perhaps other hex viewers do.
And to answer one of my original questions, 74078 turned out to be stored as 5E2101. A bit more trial and error and I would have got there. :)
UPDATE
If I was doing this over, I'd use "Synalyze It!", a tool designed for exactly this purpose.
We were asked to find a way to compress a square binary matrix as much as possible, and if possible, to add redundancy bits to check and maybe correct errors.
The redundancy thing is easy to implement in my opinion. The complicated part is compressing the matrix. I thought about using run-length after reshaping the matrix to a vector because there will be more zeros than ones, but I only achieved a 40bits compression (we are working on small sizes) although I thought it'd be better.
Also, after run-length an idea was Huffman coding the matrix, but a dictionary must be sent in order to recover the original information.
I'd like to know what would be the best way to compress a binary matrix?
After reading some comments, yes #Adam you're right, the 14x14 matrix should be compressed in 128bits, so if I only use the coordinates (rows&cols) for each non-zero element, still it would be 160bits (since there are twenty ones). I'm not looking for an exact solution but for a useful idea.
You can only talk about compressing something if you have a distribution and a representation. That's the issue of the dictionary you have to send along: you always need some sort of dictionary of protocol to uncompress something. It just so happens that things like .zip and .mpeg already have those dictionaries/codecs. Even something as simple as Huffman-encoding is an algorithm; on the other side of the communication channel (you can think of compression as communication), the other person already has a bit of code (the dictionary) to perform the Huffman decompression scheme.
Thus you cannot even begin to talk about compressing something without first thinking "what kinds of matrices do I expect to see?", "is the data truly random, or is there order?", and if so "how can I represent the matrices to take advantage of order in the data?".
You cannot compress some matrices without increasing the size of other objects (by at least 1 bit). This is bad news if all matrices are equally probable, and you care equally about them all.
Addenda:
The answer to use sparse matrix machinery is not necessarily the right answer. The matrix could for example be represented in python as [[(r+c)%2 for c in range (cols)] for r in range(rows)] (a checkerboard pattern), and a sparse matrix wouldn't compress it at all, but the Kolmogorov complexity of the matrix is the above program's length.
Well, I know every matrix will have the same number of ones, so this is kind of deterministic. The only think I don't know is where the 1's will be. Also, if I transmit the matrix with a dictionary and there are burst errors, maybe the dictionary gets affected so... wouldnt be the resulting information corrupted? That's why I was trying to use lossless data compression such as run-length, the decoder just doesnt need a dictionary. --original poster
How many 1s does the matrix have as a fraction of its size, and what is its size (NxN -- what is N)?
Furthermore, this is an incorrect assertion and should not be used as a reason to desire run-length encoding (which still requires a program); when you transmit data over a channel, you can always add error-correction to this data. "Data" is just a blob of bits. You can transmit both the data and any required dictionaries over the channel. The error-correcting machinery does not care at all what the bits you transmit are for.
Addendum 2:
There are (14*14) choose 20 possible arrangements, which I assume are randomly chosen. If this number was larger than 128^2 what you're trying to do would be impossible. Fortunately log_2((14*14) choose 20) ~= 90bits < 128bits so it's possible.
The simple solution of writing down 20 numbers like 32,2,67,175,52,...,168 won't work because log_2(14*14)*20 ~= 153bits > 128bits. This would be equivalent to run-length encoding. We want to do something like this but we are on a very strict budget and cannot afford to be "wasteful" with bits.
Because you care about each possibility equally, your "dictionary"/"program" will simulate a giant lookup table. Matlab's sparse matrix implementation may work but is not guaranteed to work and is thus not a correct solution.
If you can create a bijection between the number range [0,2^128) and subsets of size 20, you're good to go. This corresponds to enumerating ways to descend the pyramid in http://en.wikipedia.org/wiki/Binomial_coefficient to the 20th element of row 196. This is the same as enumerating all "k-combinations". See http://en.wikipedia.org/wiki/Combination#Enumerating_k-combinations
Fortunately I know that Mathematica and Sage and other CAS software can apparently generate the "5th" or "12th" or arbitrarily numbered k-subset. Looking through their documentation, we come upon a function called "rank", e.g. http://www.sagemath.org/doc/reference/sage/combinat/subset.html
So then we do some more searching, and come across some arcane Fortran code like http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_rank.m and http://people.sc.fsu.edu/~jburkardt/m_src/subset/ksub_unrank.m
We could reverse-engineer it, but it's kind of dense. But now we have enough information to search for k-subset rank unrank, which leads us to http://www.site.uottawa.ca/~lucia/courses/5165-09/GenCombObj.pdf -- see the section
"Generating k-subsets (of an n-set): Lexicographical
Ordering" and the rank and unrank algorithms on the next few pages.
In order to achieve the exact theoretically optimal compression, in the case of a uniformly random distribution of 1s, we must thus use this technique to biject our matrices to our output number of range <2^128. It just so happens that combinations have a natural ordering, known as ranking and unranking of combinations. You assign a number to each combination (ranking), and if you know the number you automatically know the combination (unranking). Googling k-subset rank unrank will probably yield other algorithms.
Thus your solution would look like this:
serialize the matrix into a list
e.g. [[0,0,1][0,1,1][1,0,0]] -> [0,0,1,0,1,1,1,0,0]
take the indices of the 1s:
e.g. [0,0,1,0,1,1,1,0,0] -> [3,5,6,7]
1 2 3 4 5 6 7 8 9 a k=4-subset of an n=9 set
take the rank
e.g. compressed = rank([3,5,6,7], n=9)
compressed==412 (or something, I made that up)
you're done!
e.g. 412 -binary-> 110011100 (at most n=9bits, less than 2^n=2^9=512)
to uncompress, unrank it
I'll get to 128 bits in a sec, first here's how you fit a 14x14 boolean matrix with exactly 20 nonzeros into 136 bits. It's based on the CSC sparse matrix format.
You have an array c with 14 4-bit counters that tell you how many nonzeros are in each column.
You have another array r with 20 4-bit row indices.
56 bits (c) + 80 bits (r) = 136 bits.
Let's squeeze 8 bits out of c:
Instead of 4-bit counters, use 2-bit. c is now 2*14 = 28 bits, but can't support more than 3 nonzeros per column. This leaves us with 128-80-28 = 20 bits. Use that space for array a4c with 5 4-bit elements that "add 4 to an element of c" specified by the 4-bit element. So, if a4c={2,2,10,15, 15} that means c[2] += 4; c[2] += 4 (again); c[10] += 4;.
The "most wasteful" distribution of nonzeros is one where the column count will require an add-4 to support 1 extra nonzero: so 5 columns with 4 nonzeros each. Luckily we have exactly 5 add-4s available.
Total space = 28 bits (c) + 20 bits
(a4c) + 80 bits (r) = 128 bits.
Your input is a perfect candidate for a sparse matrix. You said you're using Matlab, so you already have a good sparse matrix built for you.
spm = sparse(dense_matrix)
Matlab's sparse matrix implementation uses Compressed Sparse Columns, which has memory usage on the order of 2*(# of nonzeros) + (# of columns), which should be pretty good in your case of 20 nonzeros and 14 columns. Storing 20 values sure is better than storing 196...
Also remember that all matrices in Matlab are going to be composed of doubles. Just because your matrix can be stored as a 1-bit boolean doesn't mean Matlab won't stick it into a 64-bit floating point value... If you do need it as a boolean you're going to have to make your own type in C and use .mex files to interface with Matlab.
After thinking about this again, if all your matrices are going to be this small and they're all binary, then just store them as a binary vector (bitmask). Going off your 14x14 example, that requires 196 bits or 25 bytes (plus n, m if your dimensions are not constant). That same vector in Matlab would use 64 bits per element, or 1568 bytes. So storing the matrix as a bitmask takes as much space as 4 elements of the original matrix in Matlab, for a compression ratio of 62x.
Unfortunately I don't know if Matlab supports bitmasks natively or if you have to resort to .mex files. If you do get into C++ you can use STL's vector<bool> which implements a bitmask for you.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Although I know the basic concepts of binary representation, I have never really written any code that uses binary arithmetic and operations.
I want to know
What are the basic concepts any
programmer should know about binary
numbers and arithmetic ? , and
In what "practical" ways can binary
operations be used in programming. I
have seen some "cool" uses of shift
operators and XOR etc. but are there
some typical problems where using binary
operations is an obvious choice.
Please give pointers to some good reference material.
If you are developing lower-level code, it is critical that you understand the binary representation of various types. You will find this particularly useful if you are developing embedded applications or if you are dealing with low-level transmission or storage of data.
That being said, I also believe that understanding how things work at a low level is useful even if you are working at much higher levels of abstraction. I have found, for example, that my ability to develop efficient code is improved by understanding how things are represented and manipulated at a low level. I have also found such understanding useful in working with debuggers.
Here is a short-list of binary representation topics for study:
numbering systems (binary, hex, octal, decimal, ...)
binary data organization (bits, nibbles, bytes, words, ...)
binary arithmetic
other binary operations (AND,OR,XOR,NOT,SHL,SHR,ROL,ROR,...)
type representation (boolean,integer,float,struct,...)
bit fields and packed data
Finally...here is a nice set of Bit Twiddling Hacks you might find useful.
Unless you're working with lower level stuff, or are trying to be smart, you never really get to play with binary stuff.
I've been through a computer science degree, and I've never used any of the binary arithmetic stuff we learned since my course ended.
Have a squizz here: http://www.swarthmore.edu/NatSci/echeeve1/Ref/BinaryMath/BinaryMath.html
You must understand bit masks.
Many languages and situations require the use of bit masks, for example flags in arguments or configs.
PHP has its error level which you control with bit masks:
error_reporting = E_ALL & ~E_NOTICE
Or simply checking if an int is odd or even:
isOdd = myInt & 1
I believe basic know-hows on binary operations line AND, OR, XOR, NOT would be handy as most of the programming languages support these operations in the form of bit-wise operators.
These operations are also used in image processing and other areas in graphics.
One important use of XOR operation which I can think of is Parity check. Check this http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/xor.html
cheers
The following are things I regularly appreciate knowing in my quite conventional programming work:
Know the powers of 2 up to 2^16, and know that 2^32 is about 4.3 billion. Know them well enough so that if you see the number 2147204921 pop up somewhere your first thought is "hmm, that looks pretty close to 2^31" -- that's a very effective module for your bug radar.
Be able to do simple arithmetic; e.g. convert a hexadecimal digit to a nybble and back.
Have some vague idea of how floating-point numbers are represented in binary.
Understand standard conventions that you might encounter in other people's code related to bit twiddling (flags get ORed together to make composite values and AND checks if one's set, shift operators pack and unpack numbers into different bytes, XOR something twice and you get the same something back, that kind of thing.)
Further knowledge is mostly gravy unless you work with significant performance constraints or do other less common work.
At the absolute bare minimum you should be able to implement a bit mask solution. The tasks associated with bit mask operations should ensure that you at least understand binary at a superficial level.
From the top of my head, here are some examples of where I've used bitwise operators to do useful stuff.
A piece of javascript that needed one of those "check all" boxes was something along these lines:
var check = true;
for(var i = 0; i < elements.length; i++)
check &= elements[i].checked;
checkAll.checked = check;
Calculate the corner points of a cube.
Vec3f m_Corners[8];
void corners(float a_Size){
for(size_t i = 0; i < 8; i++){
m_Corners[i] = a_Size * Vec3f(axis(i, Vec3f::X), axis(i, Vec3f::Y), axis(i, Vec3f::Z));
}
}
float axis(size_t a_Corner, int a_Axis) const{
return ((a_Corner >> a_Axis) & 1) == 1
? -.5f
: +.5f;
}
Draw a Sierpinski triangle
for(int y = 0; y < 512; y++)
for(int x = 0; x < 512; x++)
if(x & y) pixels[x + y * w] = someColor;
else pixels[x + y * w] = someOtherColor;
Finding the next power of two
int next = 1 << ((int)(log(number) / log(2));
Checking if a number is a power of two
bool powerOfTwo = number & (number - 1);
The list can go on and on, but for me these are (except for Sierpinksi) everyday examples. Once you'll understand and work with it though, you'll encounter it in more and more places such as the corners of a cube.
You don't specifically mention (nor rule out!-) floating point binary numbers and arithmetic, so I won't miss the opportunity to flog one of my favorite articles ever (seriously: I sometimes wish I could make passing a strict quiz on it a pre-req of working as a programmer...;-).
The most important thing every programmer should know about binary numbers and arithmetic is : Every number in a computer is represented in some kind of binary encoding, and all arithmetic on a computer is binary arithmetic.
The consequences of this are many:
Floating point "bugs" when doing math with IEEE floating point binary numbers (Which is all numbers in javascript, and quite a few in JAVA, and C)
The upper and lower bounds of representable numbers for each type
The performance cost of multiplication/division/square root etc operations (for embedded systems
Precision loss, and accumulation errors
and more. This is stuff you need to know even if you never do a bitwise xor, or not, or whatever in your life. You'll still run into these things.
This really depends on the language you're using. Recent languages such as C# and Java abstract the binary representation from you -- this makes working with binary difficult and is not usually the best way to do things anyway in these languages.
Middle and low level languages like C and C++, however, require you to understand quite a bit about how the numbers are stored underneath -- especially regarding endianness.
Binary knowledge is also useful when implementing a cross platform protcol of some sort .... for example, on x86 machines, byte order is little endian. but most network protocols want big endian numbers. Therefore you have to realize you need to do the conversion for things to go smoothly. Many RFCs, such as this one -> https://www.rfc-editor.org/rfc/rfc4648 require binary knowledge to understand.
In short, it's completely dependent on what you're trying to do.
Billy3
It's handy to know the numbers 256 and 65536. It's handy to know how two's complement negative numbers work.
Maybe you won't run into a lot of binary. I still use it pretty often, but maybe out of habit.
A good familiarity with bitwise operations should make you more facile with boolean algebra, and I think that's important for every programmer--you want to be able to quickly simplify complex logic expressions.
Absolute minimum is, that "2" is not a binary digit and 10b is smaller than 3.
If you never do low-level programming (like C in embedded systems), never have to use a debugger, and never have to work with real numbers, then I suppose you could get by without knowing binary. But knowing binary will make you a stronger programmer, even if indirectly.
Once you venture into those areas you will need to know binary (and its ``sister'' base, hexadecimal). Without knowing it:
Embedded systems programming would be impossible.
Debugging would be hard because you wouldn't know what you were looking at in memory.
Numerical calculations with decimals would give you answers you don't understand.
I learned to twiddle bits back when c and asm were still used for "mainstream" programming. Although I no longer have much use for that knowledge, I recently used it to solve a real-world business problem.
We use a fax service that posts a message back to us when the fax has been sent or failed after x number of retries. The only way I had to identify the fax was a 15 character field. We wanted to consolidate this into one URL for all of our clients. Before we consolidated, all we had to fit in this field was the FaxID PK (32 bit int) column which we just sent as a string.
Now we had to identify the client (a 4 character code) and the database (32 bit int) underneath the client. I was able to do this using base 64 encoding. Without understanding the binary representation of numbers and characters, I probably would never have even thought of this solution.
Some useful information about the number system.
Binary | base 2
Hexadecimal | base 16
Decimal | base 10
Octal | base 8
These are the most common.
Converting them is faily easy.
112 base 8 = (1 x 8^2) + (2 x 8^1) + (4 x 8^0)
74 base 10 = (7 x 10^1) + (4 x 10^0)
The AND, OR, XOR, and etc. are used in logic gates. Search boolean algebra, something well worth the time knowing.
Say for instance, you have 11001111 base 2 and you want to extract the last four only.
Truth table for AND:
P | Q | R
T | T | T
T | F | F
F | F | F
F | T | F
You can use 11001111 base 2 AND 00111111 base 2 = 00001111 base 2
There are plenty of resources on the internet.
It goes without saying that using hard-coded, hex literal pointers is a disaster:
int *i = 0xDEADBEEF;
// god knows if that location is available
However, what exactly is the danger in using hex literals as variable values?
int i = 0xDEADBEEF;
// what can go wrong?
If these values are indeed "dangerous" due to their use in various debugging scenarios, then this means that even if I do not use these literals, any program that during runtime happens to stumble upon one of these values might crash.
Anyone care to explain the real dangers of using hex literals?
Edit: just to clarify, I am not referring to the general use of constants in source code. I am specifically talking about debug-scenario issues that might come up to the use of hex values, with the specific example of 0xDEADBEEF.
There's no more danger in using a hex literal than any other kind of literal.
If your debugging session ends up executing data as code without you intending it to, you're in a world of pain anyway.
Of course, there's the normal "magic value" vs "well-named constant" code smell/cleanliness issue, but that's not really the sort of danger I think you're talking about.
With few exceptions, nothing is "constant".
We prefer to call them "slow variables" -- their value changes so slowly that we don't mind recompiling to change them.
However, we don't want to have many instances of 0x07 all through an application or a test script, where each instance has a different meaning.
We want to put a label on each constant that makes it totally unambiguous what it means.
if( x == 7 )
What does "7" mean in the above statement? Is it the same thing as
d = y / 7;
Is that the same meaning of "7"?
Test Cases are a slightly different problem. We don't need extensive, careful management of each instance of a numeric literal. Instead, we need documentation.
We can -- to an extent -- explain where "7" comes from by including a tiny bit of a hint in the code.
assertEquals( 7, someFunction(3,4), "Expected 7, see paragraph 7 of use case 7" );
A "constant" should be stated -- and named -- exactly once.
A "result" in a unit test isn't the same thing as a constant, and requires a little care in explaining where it came from.
A hex literal is no different than a decimal literal like 1. Any special significance of a value is due to the context of a particular program.
I believe the concern raised in the IP address formatting question earlier today was not related to the use of hex literals in general, but the specific use of 0xDEADBEEF. At least, that's the way I read it.
There is a concern with using 0xDEADBEEF in particular, though in my opinion it is a small one. The problem is that many debuggers and runtime systems have already co-opted this particular value as a marker value to indicate unallocated heap, bad pointers on the stack, etc.
I don't recall off the top of my head just which debugging and runtime systems use this particular value, but I have seen it used this way several times over the years. If you are debugging in one of these environments, the existence of the 0xDEADBEEF constant in your code will be indistinguishable from the values in unallocated RAM or whatever, so at best you will not have as useful RAM dumps, and at worst you will get warnings from the debugger.
Anyhow, that's what I think the original commenter meant when he told you it was bad for "use in various debugging scenarios."
There's no reason why you shouldn't assign 0xdeadbeef to a variable.
But woe betide the programmer who tries to assign decimal 3735928559, or octal 33653337357, or worst of all: binary 11011110101011011011111011101111.
Big Endian or Little Endian?
One danger is when constants are assigned to an array or structure with different sized members; the endian-ness of the compiler or machine (including JVM vs CLR) will affect the ordering of the bytes.
This issue is true of non-constant values, too, of course.
Here's an, admittedly contrived, example. What is the value of buffer[0] after the last line?
const int TEST[] = { 0x01BADA55, 0xDEADBEEF };
char buffer[BUFSZ];
memcpy( buffer, (void*)TEST, sizeof(TEST));
I don't see any problem with using it as a value. Its just a number after all.
There's no danger in using a hard-coded hex value for a pointer (like your first example) in the right context. In particular, when doing very low-level hardware development, this is the way you access memory-mapped registers. (Though it's best to give them names with a #define, for example.) But at the application level you shouldn't ever need to do an assignment like that.
I use CAFEBABE
I haven't seen it used by any debuggers before.
int *i = 0xDEADBEEF;
// god knows if that location is available
int i = 0xDEADBEEF;
// what can go wrong?
The danger that I see is the same in both cases: you've created a flag value that has no immediate context. There's nothing about i in either case that will let me know 100, 1000 or 10000 lines that there is a potentially critical flag value associated with it. What you've planted is a landmine bug that, if I don't remember to check for it in every possible use, I could be faced with a terrible debugging problem. Every use of i will now have to look like this:
if (i != 0xDEADBEEF) { // Curse the original designer to oblivion
// Actual useful work goes here
}
Repeat the above for all of the 7000 instances where you need to use i in your code.
Now, why is the above worse than this?
if (isIProperlyInitialized()) { // Which could just be a boolean
// Actual useful work goes here
}
At a minimum, I can spot several critical issues:
Spelling: I'm a terrible typist. How easily will you spot 0xDAEDBEEF in a code review? Or 0xDEADBEFF? On the other hand, I know that my compile will barf immediately on isIProperlyInitialised() (insert the obligatory s vs. z debate here).
Exposure of meaning. Rather than trying to hide your flags in the code, you've intentionally created a method that the rest of the code can see.
Opportunities for coupling. It's entirely possible that a pointer or reference is connected to a loosely defined cache. An initialization check could be overloaded to check first if the value is in cache, then to try to bring it back into cache and, if all that fails, return false.
In short, it's just as easy to write the code you really need as it is to create a mysterious magic value. The code-maintainer of the future (who quite likely will be you) will thank you.