I am trying to parse Well Known Binary a binary encoding of geometry objects used in Geographic Information Systems (GIS). I am using this spec from ESRI (same results here from esri). I have input data from Osmosis a tool to parse OpenStreetMap data, specifically the pgsimp-dump format which gives the hex represenation of the binary.
The ESRI docs say that there should only be 21 bytes for a Point, 1 byte for byte order, 4 for uint32 for typeid, and 8 for double x and 8 for double y.
An example from osmosis is this (hex) example: 0101000020E6100000DB81DF2B5F7822C0DFBB7262B4744A40, which is 25 bytes long.
Shapely a python programme to parse WKB (etc), which is based on the popular C library GEOS is able to parse this string:
>>> import shapely.wkb
>>> shapely.wkb.loads("0101000020E6100000DB81DF2B5F7822C0DFBB7262B4744A40", hex=True)
<shapely.geometry.point.Point object at 0x7f221f2581d0>
When I ask Shapely to parse from then convert to WKB I get a 21 bytes.
>>> shapely.wkb.loads("0101000020E6100000DB81DF2B5F7822C0DFBB7262B4744A40", hex=True).wkb.encode("hex").upper()
'0101000000DB81DF2B5F7822C0DFBB7262B4744A40'
The difference is the 4 bytes in the middle, which appear 3 bytes into the uint32 for the typeif=d
01010000**20E61000**00DB81DF2B5F7822C0DFBB7262B4744A40
Why can shapely/geos parse this WKB when it's invalid WKB? What do these bytes mean?
GEOS / Shapely use an Extended variant of WKT/WKB called EWKT / EWKB, which is documented by PostGIS. If you have access to PostGIS, you can see what's going on here:
SELECT ST_AsEWKT('0101000020E6100000DB81DF2B5F7822C0DFBB7262B4744A40'::geometry);
Returns the EWKT SRID=4326;POINT(-9.2351011 52.9117549). So the extra data was the spatial reference identifier, or SRID. Specifically EPSG:4326 for WGS 84.
Shapely does not support SRIDs, however there are a few hacks, e.g.:
from shapely import geos
geos.WKBWriter.defaults['include_srid'] = True
should now make wkb or wkb_hex output the EWKB, which includes the SRID. The default is False, which would output ISO WKB for 2D geometries (but not for 3D).
So it seems your objective is to convert EWKB to ISO WKB, which you can do with GEOS / Shapely for 2D geometries only. If you have 3D (Z or M) or 4D (ZM) geometries, then only PostGIS is able to do this conversion.
Related
I am new to dynamo db binary data. I have a hash key + range key(both are byte[]). Now I am trying to get a list of items by querying on range key(ex: le, ge or between). I am able to do put and get operations fine.
However I am getting errors while doing this. My question is can dynamodb do this comparison? I am passing a byte[]. Can dynamodb check if existing rangekey(byte[]) is lesser or greater than this?
Yes, DynamoDB does support the byte array type well, and also allows comparison between them in conditions, done lexicographically, so what you want to do should and does work.
You didn't say which "errors" you are getting. You should be aware that DynamoDB treats the bytes of the byte arrays as unsigned bytes. For example, the byte 128 comes after byte 127. I don't know which language you are using to test this, but some languages have signed bytes - meaning that the byte 128 is treated as "-1" and will come before, not after, byte 127 in the sort order. DynamoDB doesn't do that, because it uses unsigned bytes.
I've recently heard of PSON, and I hear that its similar to JSON. that is is different in how the objects are encoded. But how are they different? More specifically, how are they different when used for serializing and deserializing data?
PSON does not differ from JSON in its representation of objects, arrays, numbers, booleans, and null values. PSON does serialize strings differently from JSON.
A PSON string is a sequence of 8-bit ASCII encoded data. It must start and end with “ (ASCII 0x22) characters. Between these characters it may contain any byte sequence.
PSON combines the best of JSON, BJSON, ProtoBuf and a bit of ZIP to achieve a superior small footprint on the network level. Basic constants and small integer values are efficiently encoded as a single byte. Other integer values are always encoded as variable length integers. Additionally it comes with progressive and static dictionaries to reduce data redundancy to a minimum.
246 single byte values
Base 128 variable length integers (varints) as in protobuf
32 bit floats instead of 64 bit doubles if possible without information loss
Progressive and static dictionaries
Raw binary data support
Long support
Whereas JSON requires that the serialized form is valid unicode (usually UTF-8)
you can decode a PSON with JSON Parsers
This might be a trivial question... Or might not be. When I serialize an object to JSON how are numbers represented?
Specifically, I need to know how efficiently they are encoded to binary. There are 2 ways:
Transform number to its decimal string representation and then encode that string to binary.
Or encode the number directly to binary.
Which is the case?
That is a big difference: Let's say serialized object contains number 12345678. Encoded first way it will take 8 B to transfer, encoded second way only 4 B. When it comes to lots of big numbers (my case) than in the first case I would better use base64 as pre-process for serialization.
I can imagine that this might be dependent on serializer (though I really hope it is not). In that case, I am using Firebase Realtime database SDK.
JSON is a textual notation. So the number 12345678 is sent as those eight characters, 1, 2, 3, etc. Depending on your text encoding, that's probably eight bytes (e.g., UTF-8 or Windows-1252; but if you were using UTF-16, for instance, it would be 16 bytes).
There have been various "binary JSON" proposals over the years, but I don't think any of them really caught on outside of specific applications (for instance, BSON in MongoDB).
I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.
I am pulling data from a Tektronix oscilloscope in Tektronix' RIBinary format using a TCL script, and then within the script I need to convert that to a decimal value.
I have done very little with binary conversions in the first place, but to add to my frustration the documentation on this binary format is also very vague in my opinion. Anyway, here's my current code:
proc ::Scope::CaptureWaveform {VisaAlias Channel} {
# Apply scope settings
::VISA::Write $VisaAlias "*WAI"
::VISA::Write $VisaAlias "DATa:STARt 1"
::VISA::Write $VisaAlias "DATa:STOP 4000"
::VISA::Write $VisaAlias "DATa:ENCdg RIBinary"
::VISA::Write $VisaAlias "DATa:SOUrce $Channel"
# Download waveform
set RIBinaryWaveform [::VISA::Query $VisaAlias "CURVe?"]
# Parse out leading label from scope output
set RIBinaryWaveform [string range $RIBinaryWaveform 11 end]
# Convert binary data to a binary string usable by TCL
binary scan $RIBinaryWaveform "I*" TCLBinaryWaveform
set TCLBinaryWaveform
# Convert binary data to list
}
Now, this code pulls the following data from the machine:
-1064723993 -486674282 50109321 -6337556 70678 8459972 143470359 1046714383 1082560884 1042711231 1074910212 1057300801 1061457453 1079313832 1066305613 1059935120 1068139252 1066053580 1065228329 1062213553
And this is what the machine pulls when I just take regular ASCII data (i.e. what the above data should look like after the conversion):
-1064723968 -486674272 50109320 -6337556 70678 8459972 143470352 1046714368 1082560896 1042711232 1074910208 1057300800 1061457472 1079313792 1066305600 1059935104 1068139264 1066053568 1065228352 1062213568
Finally, here is a reference to the RIBinary specification from Tektronix since I don't think it is a standard data type:
http://www.tek.com/support/faqs/how-binary-data-represented-tektronix-oscilloscopes
I've been looking for a while now on the Tektronix website for more information on converting the data and the above URL is all I've been able to find, but I'll comment or edit this post if I find any more information that might be useful.
Updates
Answers don't necessarily have to be in TCL. If anyone can help me logically work through this on a high level I can hash out the TCL details (this I think would be more helpful to others as well)
The reason I need to transfer the data in binary and then convert it afterwards is for the purpose of optimization. Due to this I can't have the device perform the conversion before the transfer as it will slow down the process.
I updated my code some and now my results are maddeningly close to the actual results. I assume it may have something to do with the commas that are in the data originally.
Below are now examples of the raw data sent from the device without any of my parsing.
On suggestion from #kostix, I made a second script with code he gave me that I modified to fit my data set. It can be seen below, however the result are exactly the same as my above code.
ASCIi:
:CURVE -1064723968,-486674272,50109320,-6337556,70678,8459972,143470352,1046714368,1082560896,1042711232,1074910208,1057300800,1061457472,1079313792,1066305600,1059935104,1068139264,1066053568,1065228352,1062213568
RIBinary:
:CURVE #280ÀçâýðüÿKì
Note on RIBinary - ":CURVE #280" is all part of the header that I need to parse out, but the #280 part of it can vary depending on the data I'm collecting. Here's some more info from Tektronix on what the #280 means:
block is the waveform data in binary format. The waveform is formatted
as: # where is the number of y bytes. For
example, if = 500, then = 3. is the number of bytes to
transfer including checksum.
So, for my current data set x = 2 and yyy = 80. I am just really unfamiliar with converting binary data, so I'm not sure what to do programmatically to deal with the block format.
On suggestion from #kostix I made a second script with code he gave me that I modified to fit my data set:
set RIBinaryWaveform [::VISA::Query ${VisaAlias} "CURVe?"]
binary scan $RIBinaryWaveform a8a curv nbytes
encoding convertfrom ascii ${curv}
scan $nbytes %u n
set n
set headerlen [expr {$n + 9}]
binary scan $RIBinaryWaveform #9a$n nbytes
scan $nbytes %u n
set n
set numints [expr {$n / 4}]
binary scan $RIBinaryWaveform #${headerlen}I${numints} data
set data
The output of this code is the same as the code I provided above.
According to the documentation you link to, RIBinary is signed big-endian. Thus, you convert the binary data to integers with binary scan $data "I*" someVar (I* means “as many big-endian 4-byte integers as you can”). You use the same conversion with RPBinary (if you've got that) but you then need to chop each value to the positive 32-bit integer range by doing & 0xFFFFFFFF (assuming at least Tcl 8.5). For FPBinary, use R* (requires 8.5). SRIBinary, SRPBinary and SFPBinary are the little-endian versions, for which you use lower-case format characters.
Getting conversions correct can take some experimentation.
I have no experience with this stuff but like googleing. Here are my findings.
This document, in the section titled "Formatted I/O Operations" tells that the viQueryf() standard C API function combines viPrintf() (writing to a device) with viScanf() (reading from a device), and examples include calls like viQueryf (io, ":CURV?\n", "%#b", &totalPoints, rdBuffer); (see the section «IEEE-488.2 Binary Data—"%b"»), where the third argument to the function specifies the desired format.
The VISA::Query procedure from your Tcl library pretty much resembles that viQueryf() in my eyes, so I'd expect it to accept the third (optional) argument which specifies the format you want the data to be in.
If there's nothing like it, let's look at your ASCII data. Your FAQ entry and the document I found both specify that the opaque data might come in the form of a series of integers of different size and endianness. The "RIBinary" format states it should be big-endian signed integers.
The binary scan Tcl command is able to scan 16-bit and 32-bit big-endian integers from a byte stream — use the S* and I* formats, correspondingly.
Your ASCII data clearly looks like 32-bit integers, so I'd try scanning using I*.
Also see this doc — it appears to have much in common with the PDF guide I linked above, but might be handy anyway.
TL;DR
Try studying your API to find a way to explicitly tell the device the data format you want. This might produce a more robust solution in the case the device might be somehow reconfigured externally to change its default data format effectively pulling the rug under the feet of your code which relies on certain (guessed) default.
Try interpreting the data as outlined above and see if the interpretation looks sensible.
P.S.
This might mean nothing at all, but I failed to find any example which has "e" between the "CURV" and the "?" in the calls to viQueryf().
Update (2013-01-17, in light of the new discoveries about the data format): to binary scan the data of varying types, you might employ two techniques:
binary scan accepts as many specifiers in a row, you like; they're are processed from left to right as binary scan reads the supplied data.
You can do multiple runs of binary scanning over a chunk of your binary data either by cutting pieces of this chunk (string manipulation Tcl commands understand they're operating on a byte array and behave accordingly) or use the #offset term in the binary scan format string to make it start scanning from the specified offset.
Another technique worth employing here is that you'd better first train yourself on a toy example. This is best done in an interactive Tcl shell — tkcon is a best bet but plain tclsh is also OK, especially if called via rlwrap (POSIX systems only).
For instance, you could create a fake data for yourself like this:
% set b [encoding convertto ascii ":CURVE #224"]
:CURVE #224
% append b [binary format S* [list 0 1 2 -3 4 -5 6 7 -8 9 10 -11]]
:CURVE #224............
Here we first created a byte array containing the header and then created another byte array containing twelve 16-bit integers packed MSB first, and then appended it to the first array essentially creating a data block our device is supposed to return (well, there's less integers than the device returns). encoding convertto takes the name of a character encoding and a string and produces a binary array of that string converted to the specified encoding. binary format is told to consume a list of arbitrary size (* in the format list) and interpret it as a list of 16-bit integers to be packed in the big-endian format — the S format character.
Now we can scan it back like this:
% binary scan $b a8a curv nbytes
2
% encoding convertfrom ascii $curv
:CURVE #
% scan $nbytes %u n
1
% set n
2
% set headerlen [expr {$n + 9}]
11
% binary scan $b #9a$n nbytes
1
% scan $nbytes %u n
1
% set n
24
% set numints [expr {$n / 2}]
12
% binary scan $b #${headerlen}S${numints} data
1
% set data
0 1 2 -3 4 -5 6 7 -8 9 10 -11
Here we proceeded like this:
Interpret the header:
Read the first eight bytes of the data as ASCII characters (a8) — this should read our :CURVE # prefix. We convert the header prefix from the packed ASCII form to the Tcl's internal string encoding using encoding convertfrom.
Read the next byte (a) which is then interpreted as the length, in bytes, of the next field, using the scan command.
We then calculate the length of the header read so far to use it later. This values is saved to the "headerlen" variable. The length of the header amounts to the 9 fixed bytes plus variable-number of bytes (2 in our case) specifying the length of the following data.
Read the next field which will be interpreted as the "number of data bytes" value.
To do this, we offset the scanner by 9 (the length of ":CURVE #2") and read so many ASCII bytes as obtained on the previous step, so we use #9a$n for the format: $n is just obtaining the value of a variable named "n", and it will be 2 in our case. Then we scan the obtained value and finally get the number of the following raw data.
Since we will read 16-bit integers, not bytes, we divide this number by 2 and store the result to the "numints" variable.
Read the data. To do this, we have to offset the scanner by the length of the header. We use #${headerlen}S${numints} for the format string. Tcl expands those ${varname} before passing the string to the binary scan so the actual string in our case will be #11S12 which means "offset by 11 bytes then scan 12 16-bit big-endian integers".
binary scan puts a list of integers to the variable which name is passed, so no additional decoding of those integers is needed.
Note that in the real program you should probably do certain sanity checks:
* After the first step check that the static part of the header is really ":CURVE #".
* Check the return value of binary scan and scan after each invocation and check it equals to the number of variables passed to the command (which means the command was able to parse the data).
One more insight. The manual you cited says:
is the number of bytes to transfer including checksum.
so it's quite possible that not all of those data bytes represent measures, but some of them represent the checksum. I don't know what format (and hence length) and algorithm and position of this checksum is. But if the data does indeed include a checksum, you can't interpret it all using S*. Instead, you will probably take another approach:
Extract the measurement data using string range and save it to a variable.
binary scan the checksum field.
Calculate the checksum on the data obtained on the first step, verify it.
Use binary scan on the extracted data to get back your measurements.
Checksumming procedures are available in tcllib.
# Download waveform
set RIBinaryWaveform [::VISA::Query ${VisaAlias} "CURVe?"]
# Extract block format data
set ResultCount [expr [string range ${RIBinaryWaveform} 2 [expr [string index${RIBinaryWaveform} 1] + 1]] / 4]
# Parse out leading label from Tektronics block format
set RIBinaryWaveform [string range ${RIBinaryWaveform} [expr [string index ${RIBinaryWaveform} 1] + 2] end]
# Convert binary data to integer values
binary scan ${RIBinaryWaveform} "I${ResultCount}" Waveform
set Waveform
Okay, the code above does the magic trick. This is very similar to all the things discussed on this page, but I figured I needed to clear up the confusion about the numbers from the binary conversion being different from the numbers received in ASCII.
After troubleshooting with a Tektronix application specialist we discovered that the data I had been receiving after the binary conversion (the numbers that were off by a few digits) were actually the true values captured by the scope.
The reason the ASCII values are wrong is a result of the binary-to-ASCII conversion done by the instrument and then the incorrect values are then passed by the scope to TCL.
So, we had it right a few days ago. The instrument was just throwing me for a loop.