How do I discover the encoding of a JSON message? - json

JSON's official specification says:
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and...
So, essentially the JSON message can come in any of those three encodings. But... how do I guess which one is it when I receive it?
The message can come from multiple sources, such as a queue, from the browser, from the database, the file system, etc.
It also says to ignore Byte Order Masks (BOM):
...implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I remember XML docs had a "prolog" that specified the encoding, but I can't find anything similar for JSON messages.
Any ideas?

rsp and CouchDeveloper have covered this pretty well with their answers (I can't take credit for those).
Both answers look at the byte patterns to determine what encoding has been used. Apologies this doesn't directly answer your question, but it may help you to write an implementation of your own.

As per my understanding, whoever is the producer/sender of this JSON data must specify the type of encoding used instead of the receiver trying to guess it. Usually this information is a part of API documentation that the producer/sender provides to the receiver.

Related

Is some format mandatory for the plaintext/payload of JOSE JWE/JWS packets?

I want to transfer and share some data inside JOSE JWE/JWS packets between different endpoints running differing os/libraries. Therefore I want to adhere to the relevant Standards (RFCs) as closely as possible, for interoperability. Sadly I did not find an answer while reading these texts (maybe missed something?).
Some of my payloads are naturally JSON while others are not. I think it would be dumb to convert the others into a JSON wrapper with just one entry, if only one entry is possible anyways.
I noticed anyway that some libraries only allow some form of dictionaries when encoding data into JWE/JWS, while others will accept any string. Therefore I am concerned if it would be considered bad practice to encode data plainly into these formats or not?
I would like to design my protocols future proof, which is why I am very concerned for doing stuff the right way when working with encryption/encoding.
Only JWTs in JWE or JWS format needs to be a top level JSON object. But there is no requirement to payload/plaintext format/content in pure JWS and JWE.

Decoding parameters in Google webapps

I'm trying to better understand the google web applications, the HTML source has JSON that has been encoded in some unknown way which i would like to decode. For example the below source contains parameter such as DpimGf, EP1ykd which makes no sense
view-source:https://contacts.google.com/
..window.WIZ_global_data = {"DpimGf":false,"EP1ykd":.....
So far i have tried the following
1. Decoded using the base64 decoder, but results are unprintable/not usable.
2. Decoded using poly-line encoding used in Google Maps.
3. Built an app from scratch to perform base64->binary->XOR->ASCII char and to shift the binary values up-to 7 places [inspired by poly-line algorithm]
Is there any documentation from google or known encoding for such formats.
Assumptions : I'm pretty sure that this encoding of some sort and not encryption, because
1) Length of the encrypted text dont match the common encryption algorithms
2) There is some sort of pattern with value of the parameters,
So good chance that its just encoded without any encryption.
Because common encryption provides completely different strings each time.
3) There is a good chance that they may not decode,
because it might have a mapping at server side to a meaningful parameter.
Thanks
try using this http://ddecode.com/hexdecoder/?results=cddb95fa500e7c1af427d005985260a7. try running the whole thing in this it might help

Is there a standard to specify a binary format in json

I would like to know whether there is some standard that specifies binary formats using JSON as the describing language, similar to google's protocol buffers.
Protocol buffers seem very powerful but they require parsing of yet another language and considerable overhead, especially for compiled languages such as C++.
So I am wondering whether there is some accepted standard that uses JSON to describe a binary format. (Parsing the binary data might then still require some manual steps, but at least a clear and unique description of the data can be made available.)
To be clear, I am not talking about encoding binary data in JSON, I am talking about describing binary data in JSON.
Head to the ultimate Wikipedia listing and evaluate for yourself. I don't know what is the right argument to overcome your programmer's inertia. I'd consider Apache Avro the most fitting your requirement - it has JSON description.
For least friction, you could try MessagePack or BSON, which are JSON themselves, just better packed. But, by not having external declaration, need to be self descriptive, so must transport the field names on wire - so it's not as "binary" and compact as Protocol Buffers or Avro.

How to generate automatically asn.1 encoded packets?

I want to test my application and I need to generate different load. Application is SUPL RRLP protocol parser, I have ASN.1 specification for this protocol. Packets have a lot of optional fields and number of varians may be over billion - I can't go through all the options manually. I want to automate it.
The first way to generate packets automatically, the other way is to create a lot different value assignments sets and encode each into binary format.
I found some tools, for example libtasn and Asn1Editor, but the first one can't parse existing ASN.1 spec file; the second one can't encode packets by specification.
I'm afraid to create thousandth ASN.1 parser because I can introduce errors in test process.
I hoped it's easy to find something existing, but... I'm capitulating.
Maybe, someone faced with the same problem on stackowerflow and found the solution? Or know something to recommend. I'll thank you.
Please try going to https://asn1.io/asn1playground/ and try your specification there. You can ask it to generate a sample value for a given ASN.1 type. You can encode it and edit either the encoded (hex) data, or decoded values to create additional values.
You can also download a free trial of the OSS ASN.1 Tools from http://www.oss.com/asn1/products/asn1-download.html which includes OSS ASN.1 Studio. This also allows you to generate (and modify) sample values for a given ASN.1 type.
Note that these don't generate thousands of different test values for you automatically, but will parse valid value notation and encode the values for you if you are able to generate valid ASN.1 value notation.

How to read the encoding header without knowing the encoding?

If I am reading an XML of HTML file, don't I have to read the tag that tells me the encoding to be able to read the file? Isn't that tag encoded the same way the file is? I am curious how you read that tag with out knowing the encoding. I realize this is solved problem. I am just curious how its done.
Update 1
I dont get it, in UTF-16 wont each character take 2 bytes, not one, and be different than ascii? For example the character E in UTF-16 (U+0045) is 0xfeff0045. That is 0xfeff then 0x0045, but some encodings change the endian of that. Do you have to figure it out by checkign for 0xfeff and realizing that can't be ASCII or something?
Here's what W3C has to say about it:
The XML encoding declaration functions
as an internal label on each entity,
indicating which character encoding is
in use. Before an XML processor can
read the internal label, however, it
apparently has to know what character
encoding is in use--which is what the
internal label is trying to indicate.
In the general case, this is a
hopeless situation. It is not entirely
hopeless in XML, however, because XML
limits the general case in two ways:
each implementation is assumed to
support only a finite set of character
encodings, and the XML encoding
declaration is restricted in position
and content in order to make it
feasible to autodetect the character
encoding in use in each entity in
normal cases.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing
The encoding name is limited to ([A-Za-z0-9._] |'-'), so it's identical for any encoding based on ASCII or ISO-646 (e.g. ISO 8859-*, ISO 10646/Unicode).
Edit: There are still some ambiguities though. For example, you still need to have some idea of whether to attempt to read 8-, 16-, or 32-bit chunks at a time to read it. There's also the minor detail that to be a proper UTF-16 or UTF-32/UCS-4 file, it should start with a BOM -- but the XML spec doesn't seem to allow inclusion of a BOM...
If, however, you know the file is supposed to contain XML, you have a pretty good idea of how the file needs to start, so an incorrect guess is easy to detect.
For HTML, it is documented in HTML5. (Don't read if you still believe anything is sane on the web, though.)