What charset to use for json with base64 encoded binary data? - json

What is the most space efficient charset for JSON (UTF-8/16/32) for use of base64 encoded binary data?
{ data: "jA0EAwMCxamDRMfOGV5gyZPnyX1BB" }

Base64 is ASCII, so if the bulk of your JSON is Base64-encoded data, the most space-efficient encoding will be UTF-8. UTF-8 encodes ASCII characters (code points 0000–007F) as one byte, whereas UTF-16 and UTF-32 encode them as two and four, respectively.
Furthermore, it's just a good idea to use UTF-8, because it's the default encoding for JSON and not all tools support other encodings. From RFC-7159:
8.1 Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Related

Control characters in JSON string

The JSON specification states that control characters that must be escaped are only with codes from U+0000 to U+001F:
7. Strings
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).
Main idea of escaping is to don't damage output when printing JSON document or message on terminal or paper.
But there other control characters like [DEL] from C0 and other control characters from C1 set (U+0080 through U+009F). Shouldn't be they also escaped in JSON strings?
From the JSON specification:
8. String and Character Issues
8.1. Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
In UTF-8, all codepoints above 127 are encoded in multiple bytes. About half of those bytes are in the C1 control character range. So in order to avoid having those bytes in a UTF-8 encoded JSON string, all of those code points would need to be escaped. This effectively eliminates the use of UTF-8 and the JSON string might as well be encoded in ASCII. As ASCII is a subset of UTF-8 this is not disallowed by the standard. So if you are concerned with putting C1 control characters in the byte stream just escape them, but requiring every JSON representation to use ASCII would be wildly inefficient in anything but an english environment.
UTF-16 and UTF-32 could not possibly be parsed by something that uses the C1 (or even C0) control characters so the point is rather moot for those encodings.

How to decode base64 unicode string using T-SQL

Can't decode turkish characters in base64 string.
Base64 string = "xJ/DvGnFn8Onw7bDlsOHxLDEnsOcw5w="
When I decode it must be like this : 'ğüişçöÖÇİĞÜÜ'
I try to decode like this :
SELECT CAST(
CAST(N'' AS XML).value('xs:base64Binary("xJ/DvGnFn8Onw7bDlsOHxLDEnsOcw5w=")' , 'VARBINARY(MAX)')
AS NVARCHAR(MAX)
) UnicodeEncoding ;
Based on this answer : Base64 encoding in SQL Server 2005 T-SQL
But have response like this : '鿄볃앩쎟쎧쎶쎖쒇쒰쎞쎜'
Base64 string is correct because when I try decode in Base64decode.org it works.
Is there any way to decode turkish characters?
Your base-64 encoded data contains an UTF-8 string. MS SQL doesn't support UTF-8, only UTF-16, so it fails for any characters outside of ASCII.
The solution is to either send the data as nvarchar right away, or to encode the string as UTF-16 (and send it as varbinary or base-64, as needed).
Based on Erlang documentation, this might require an external library, unicode: http://www.erlang.org/doc/apps/stdlib/unicode_usage.html
Basically, the default seems to be UTF-8, you need to specify UTF-16 manually. UTF-16 support seems a bit clunky, but it should be quite doable.

confused by html5, utf-8 and 8859-1

Yesterday I upgraded an html page from "4.01 strict" to html5.
* http://r0k.us/rock/games/CoH/HallsOfHeroes/
The character encoding is iso-8859-1. The http://validator.w3.org fails and won't even parse it when utf-8 is specified as charset, apparently because I use footnote characters such as ² . They are in the upper 128 bytes of the character set. What confuses me is that I keep reading that the first 256 bytes of utf-8 is 8859-1.
Does anyone know why the page won't validate as utf-8 ?
Actually, only the first 128 code points are encoded in UTF-8 as ASCII, but UTF-8 is not ASCII, in particular, the next 128 code points differ.
You need to re-save the files as UTF-8 if you want them to be served as UTF-8.
The character ² ("SUPERSCRIPT TWO") is represented by the number 0xb2 (178 decimal) -- but it's represented differently in 8859-1 and UTF-8.
In 8859-1, it's represented as a single byte with the value 0xb2.
In UTF-8, it's represented as two consecutive bytes with the values 0xc2, 0xb2. See here for an explanation of the encoding.
(8859-1 is more compact that UTF-8 for files containing 8-bit characters, but it's incapable of representing anything past 255. UTF-8 is compatible with ASCII and with 8859-1 for 7-bit characters, is reasonably compact for most text, and can represent more than a million distinct characters.)
A file containing only 7-bit characters can be interpreted either as ASCII, 8859-1, or UTF-8. A file containing 8-bit characters cannot; it has to be translated.
If you're on a Unix-like system with the iconv command installed, this:
iconv -f iso-8859-1 -t utf-8
will perform the appropriate translation.

2 encodings between an HTML representation

Im reading one chapter from the W3C HTML Document Representation
In the 5.1 says this:
User agents must also know the specific character encoding that was used to transform the document character stream into a byte stream.
Then in the 5.2 says this:
The "charset" parameter identifies a character encoding, which is a method of converting a sequence of bytes into a sequence of characters.
Char-Bytes
Bytes-Char
So im wrong or there are 2 encodings between the representation...
A "character encoding" such as UTF-8 is, strictly speaking, a specification for representing characters as a sequence of bytes. But the encodings are always reversible, so we can speak of a (single) character encoding as going both ways.
Other character encodings used in practice are UTF-16 ad UTF-32.
Each of these are specifications under which you can encode text as bytes and decode bytes into characters. Two parts of the same specification.

How to distinguish UTF-8 and ASCII files?

How to distinguish UTF-8 (no BOM) and ASCII files?
If the file contains any bytes with the top bit set, then it is not ASCII.
So if the only possibilities are ASCII or UTF-8, then it's UTF-8.
If the file contains only bytes with the top bit clear, then it's meaningless to distinguish whether it's ASCII or UTF-8, since it represents exactly the same series of characters either way. But you can call it ASCII.
Of course this doesn't distinguish UTF-8 from ISO Latin or CP1252, and neither does it confirm that the so-called UTF-8 is actually valid.
http://msdn.microsoft.com/en-us/library/dd318672%28v=vs.85%29.aspx
IsTextUnicode Function
Determines if a buffer is likely to contain a form of Unicode text.