Bulk export of binary waveform data from oscilloscope to data points (csv preferred) - csv

I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.

Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.

Related

When JSON is send over network how are numbers represented (as binary or text)?

This might be a trivial question... Or might not be. When I serialize an object to JSON how are numbers represented?
Specifically, I need to know how efficiently they are encoded to binary. There are 2 ways:
Transform number to its decimal string representation and then encode that string to binary.
Or encode the number directly to binary.
Which is the case?
That is a big difference: Let's say serialized object contains number 12345678. Encoded first way it will take 8 B to transfer, encoded second way only 4 B. When it comes to lots of big numbers (my case) than in the first case I would better use base64 as pre-process for serialization.
I can imagine that this might be dependent on serializer (though I really hope it is not). In that case, I am using Firebase Realtime database SDK.
JSON is a textual notation. So the number 12345678 is sent as those eight characters, 1, 2, 3, etc. Depending on your text encoding, that's probably eight bytes (e.g., UTF-8 or Windows-1252; but if you were using UTF-16, for instance, it would be 16 bytes).
There have been various "binary JSON" proposals over the years, but I don't think any of them really caught on outside of specific applications (for instance, BSON in MongoDB).

Searching through very large rainbow table file

I am looking for the best way to search through a very large rainbow table file (13GB file). It is a CSV-style file, looking something like this:
1f129c42de5e4f043cbd88ff6360486f; somestring
78f640ec8bf82c0f9264c277eb714bcf; anotherstring
4ed312643e945ec4a5a1a18a7ccd6a70; yetanotherstring
... you get the idea - there are about ~900 Million lines, always with a hash, semicolon, clear text string.
So basically, the program should look if a specific hash is lited in this file.
Whats the fastest way to do this?
Obviously, I can't read the entire file into memory and then put a strstr() on it.
So whats the most efficent way to do this?
read file line by line, always to a strstr();
read larger chunk of the file (e.g. 10.000 lines), do a strstr()
Or would it be more efficient import all this data into an MySQL database and then search for the hash via SQL querys?
Any help is appreciated
The best way to do it would be to sort it and then use a binary search-like algorithm on it. After sorting it, it will take around O(log n) time to find a particular entry where n is the number of entries you have. Your algorithm might look like this:
Keep a start offset and end offset. Initialize the start offset to zero and end offset to the file size.
If start = end, there is no match.
Read some data from the offset (start + end) / 2.
Skip forward until you see a newline. (You may need to read more, but if you pick an appropriate size (bigger than most of your records) to read in step 3, you probably won't have to read any more.)
If the hash you're on is the hash you're looking for, go on to step 6.
Otherwise, if the hash you're on is less than the hash you're looking for, set start to the current position and go to step 2.
If the hash you're on is greater than the hash you're looking for, set end to the current position and go to step 2.
Skip to the semicolon and trailing space. The unhashed data will be from the current position to the next newline.
This can be easily converted into a while loop with breaks.
Importing it into MySQL with appropriate indices and such would use a similarly (or more, since it's probably packed nicely) efficient algorithm.
Your last solution might be the easiest one to implement as you move the whole performance optimizing to the database (and usually they are optimized for that).
strstr is not useful here as it searches a string, but you know a specific format and can jump and compare more goal oriented. Thing about strncmp, and strchr.
The overhead for reading a single line would be really high (as it is often the case for file IO). So I'd recommend reading a larger chunk and perform your search on that chunk. I'd even think about parallelizing the search by reading the next chunk in another thread and do comparison there aswell.
You can also think about using memory mapped IO instead of the standard C file API. Using this you can leave the whole contents loading to the operating system and don't have to care about caching yourself.
Of course restructuring the data for faster access would help you too. For example insert padding bytes so all datasets are equally long. This will provide you "random" access to your data stream as you can easily calculate the position of the nth entry.
I'd start by splitting the single large file into 65536 smaller files, so that if the hash begins with 0000 it's in the file 00/00data.txt, if the hash begins with 0001 it's in the file 00/01data.txt, etc. If the full file was 12 GiB then each of the smaller files would be (on average) 208 KiB.
Next, separate the hash from the string; such that you've got 65536 "hash files" and 65536 "string files". Each hash file would contain the remainder of the hash (the last 12 digits only, because the first 4 digits aren't needed anymore) and the offset of the string in the corresponding string file. This would mean that (instead of 65536 files at an average of 208 KiB each) you'd have 65536 hash files at maybe 120 KiB each and 65536 string files at maybe 100 KiB each.
Next, the hash files should be in a binary format. 12 hexadecimal digits costs 48 bits (not 12*8=96-bits). This alone would halve the size of the hash files. If the strings are aligned on a 4 byte boundary in the strings file then a 16-bit "offset of the string / 4" would be fine (as long as the string file is less than 256 KiB). Entries in the hash file should be sorted in order, and the corresponding strings file should be in the same order.
After all these changes; you'd use the highest 16-bits of the hash to find the right hash file, load the hash file and do a binary search. Then (if found) you'd get the offset for the start of the string (in the strings file) from entry in the hash file, plus get the offset for the next string from next entry in the hash file. Then you'd load data from the strings file, starting at the start of the correct string and ending at the start of the next string.
Finally, you'd implement a "hash file cache" in memory. If your application can allocate 1.5 GiB of RAM, then that'd be enough to cache half of the hash files. In this case (half the hash files cached) you'd expect that half the time the only thing you'd need to load from disk is the string itself (e.g. probably less than 20 bytes) and the other half the time you'd need to load the hash file into the cache first (e.g. 60 KiB); so on average for each lookup you'd be loading about 30 KiB from disk. Of course more memory is better (and less is worse); and if you can allocate more than about 3 GiB of RAM you can cache all of the hash files and start thinking about caching some of the strings.
A faster way would be to have a reversible encoding, so that you can convert a string into an integer and then convert the integer back into the original string without doing any sort of lookup at all. For an example; if all your strings use lower case ASCII letters and are a max. of 13 characters long, then they could all be converted into a 64-bit integer and back (as 26^13 < 2^63). This could lead to a different approach - e.g. use a reversible encoding (with bit 64 of the integer/hash clear) where possible; and only use some sort of lookup (with bit 64 of the integer/hash set) for strings that can't be encoded in a reversible way. With a little knowledge (e.g. carefully selecting the best reversible encoding for your strings) this could slash the size of your 13 GiB file down to "small enough to fit in RAM easily" and be many orders of magnitude faster.

Tools, approaches for analysing proprietary data format?

I need to analyse a binary data file containing raw data from a scientific instrument. A quick look in a hex viewer indicates that's probably no encryption or anything fancy: integers will probably be written as integers (but I don't know what byte order), and who knows about floating point.
I have access to a (closed source) program that can view the contents of the file. So I can see that a certain value is 74078. Actually searching for that value I'm not sure about - do I search for 00 01 21 5E, some other byte order, etc? (Hex Fiend doesn't support searching for decimal values) And how would I find a floating point number?
The software that produces these files runs on XP. I'd prefer tools that run on OSX if possible.
(Hmm, I wrote up this question, forgot to post it, then solved the problem. I guess I will write my own answer.)
In the end, Hex Fiend turned out to be just enough. What I was expecting to do:
Convert a known value into hex
Search for it
What I actually did:
Pick a random chunk of hex that looked like it might be a useful value
Tell Hex Fiend to display it as integer, or as float, in either little endian or big endian, until it gave a plausible looking result (ie, 45.000 is a lot more plausible than some huge integer)
Search for that result in the results I had from the closed source program.
Document it, go back to step 1. (Except that normally the next chunk wouldn't be 'random', but would follow sequentially.)
In this case there were really only three (binary) variables for how to interpret data:
float or integer
2 bytes or 4 bytes
little or big endian
With more variables the task would be a lot harder. It would have been nice if Hex Fiend could search for integers/floats directly, perhaps trying out the different combinations. Perhaps other hex viewers do.
And to answer one of my original questions, 74078 turned out to be stored as 5E2101. A bit more trial and error and I would have got there. :)
UPDATE
If I was doing this over, I'd use "Synalyze It!", a tool designed for exactly this purpose.

Emulating IBM floating point multiplication/addition in VBA

I am attempting to emulate a (no longer existing) mainframe report generator in an Access 2003 or Access 2010 environment. The data it generates must match exactly with paper reports from the early 70s. Unfortunately, the earliest years data were run on hardware that used IBM floating point representation instead of IEEE. With the help of Google, I've found a library of VBA functions that will convert a float from decimal to the IEEE 754 32bit binary format. I had to modify the library to accept either 32bit or 64bit floats, so I have a modest working knowledge of floating point formats, however, I'm having trouble making the conversion from IEEE to IBM binary format, as well as trouble multiplying and adding either the IBM or the IEEE numbers.
I haven't turned up any other libraries for performing this conversion and arithmetic operations in VBA - is there an easier way to go about this, or an existing library that I'm not finding? Failing that, a clear and straightforward explanation of the relevant algorithms?
Thanks in advance.
To be honest you'd probably do better to start by looking at the Hercules emulator.
http://www.hercules-390.org/ Other than that in theory with VBA you can use the Decimal type to get good results (note you have to CDec to create these) it uses 12 bits with a variable power of ten scalar.
A quick google shows this post from the hercules group, which confirms Alberts point about needing to know the hardware:
---Snip--
In theory, but rather less so in practice. S/360 and S/370 had a
choice of Scientific or Commercial instruction sets. The former added
the FP instructions and registers to the base; the latter the decimal
instructions, including Edit and Edit & Mark. But larger 360 (iirc /65
and up) and 370 (/155 and up) models had the union of the two, called
the Universal instruction set, and at some point the S/370 dropped the
option.
---snip---
I have to say that having looked at the hercules source code you'll probably need to figure out exactly which floating point operation codes (in terms of precision single,long, extended) are being performed.
The problem is here's your confusing the issue of decimal type in access, and that of single and double type floating point values available in access.
If you use the currency data type in access, this is a scaled integer, and will not produce rounding (that is what most of us use for financial calculations and reports). You can also use decimal values in access, and again they don't round at all as they are packed decimals.
However, both the single and double values available inside of access are in fact the same format and conform to the IEEE floating point standard.
For an access single variable, this is a 32bit number, and the range is:
-3.402823E38
to
-1.401298E-45 for negative values
and
1.401298E-45
to
3.402823E38 for positive values
That looks to be the same to me as the IEEE 754 standard.
So, if you add up values in access as a single, you should get the rouding same results.
So, Intel based, and Access single and doubles I believe are the same as this IEEE standard.
The only real issue it and here is what is the format of the original data you're pulling into access, and what kinds of text or string or conversion process is occurring when that data is pulled in and stored?
Access can convert numbers. Try typing these values at the access command line prompt (debug window)
? hex(255)
Above will show FF
? csng(&hFF)
Above will show 255
Edit:
Ah, ok, I see now I have this reversed, my wrong here. The problem here is assuming you convert a number to the older IBM format (Excess 64?), you will THEN have to get your hands on their code that they used for adding those numbers. In fact, even back then, different IBM models depending on what you purchased actually produced different results (more money = more precision).
So, not only do you need conversion routines to convert to the internal representation, you THEN need the routines that add/subtract/multiply those numbers. So, just having conversion routines is not going to get you very far, since you also have to duplicate their exact routines that do math. Those types of routines are likely not all created equal in terms of how they round numbers etc.

How can you reverse engineer a binary thrift file?

I've been asked to process some files serialized as binary (not text/JSON unfortunately) Thrift objects, but I don't have access to the program or programmer that created the files, so I have no idea of their structure, field order, etc. Is there a way using the Thrift libraries to open a binary file and analyze it, getting a list of the field types, values, nesting, etc.?
Unfortunately it appears that Thrift's binary protocol does not do very much tagging of data at all; to decode it appears to assume you have the .thrift file in hand so you know, say, the next 4 bytes are supposed to be an integer, and aren't actually the first half of a float. So it appears you are stuck with, basically, looking at the files in a hex editor (or equivalent) and trying to deduce fields based on the exact patterns you're seeing.
There are a very few helpful bits:
Each file begins with a version, protocol identifier string, and sequence number. Maps will begin with 6 bytes that identify the key and value types (first two bytes, as integer codes) plus the number of elements as a 4 byte integer. The type codes appear to be standard (the canonical location of their definitions seems to be TProtocol.h in the Thrift sources, for instance a boolean value is specified by type code 2, UTF-8 string by type code 16, and so on). Strings are prefixed by a 4 byte integer length field, and lists are prefixed by the type (1 byte) and a 4 byte length. It looks like all integer fields are saved big-endian, and floating points are saved in IEEE format (which should make doubles relatively easy to find, at least).
The TBinaryProtocol* files in Thrift have a few more helpful details; on the plus side, there are a number of different implementations so you can read the ones implemented in the language you are most comfortable with.
Sorry, I know this probably isn't that helpful but it really does appear this is all the information the Thrift binary format provides; clearly the binary format was designed with the intent that you would always know the exact protocol spec already, and that the goal was the minimize wire space, rather than make it at all easy to decode blindly.