Ideally I would want something like example.com/resources/äFg4вNгё5, minimum number of visible characters, never mind that they have to be percent encoded before transmitting them over HTTP.
Can you tell a scheme which encodes 128b UUIDs into the least number of visible characters efficiently, without the results having characters which break URLs?
Base-64 is good for this.
{098ef7bc-a96c-43a9-927a-912fc7471ba2}
could be encoded as
vPeOCWypqUOSepEvx0cbog
The usual equal-signs at the end could be dropped, as they always make the string-length a multiple of 4. And instead of + and /, you could use some safe characters. You can pick two from: - . _ ~
More information:
RFC 4648
Storing UUID as base64 String (Java)
guid to base64, for URL (C#)
Short GUID (C#)
I use a url-safe base64 string. The following is some Python code that does this*.
The last line removes '=' or '==' sign that base 64 encoding likes to put on the end, they make putting the characters into a URL more difficult and are only necessary for de-encoding the information, which does not need to be done here.
import base64
import uuid
# get a UUID - URL safe, Base64
def get_a_Uuid():
r_uuid = base64.urlsafe_b64encode(uuid.uuid4().bytes)
return r_uuid.replace('=', '')
Above does not work for Python3. This is what I'm doing instead:
r_uuid = base64.urlsafe_b64encode(uuid.uuid4().bytes).decode("utf-8")
return r_uuid.replace('=', '')
*
This does follow the standards: base64.urlsafe_b64encode follows RFC 3548 and 4648 see https://docs.python.org/2/library/base64.html. Stripping == from base64 encoded data with known length is allowed see RFC 4648 §3.2. UUID/GUID are specified in RFC 4122; §4.1 Format states "The UUID format is 16 octets". The base64-fucntion encodes these 16 octets.
Related
I have this piece of html from wikipedia.
Judith Ehrlich
I understand "=3D" is Quoted-Printable encoding for "=" but im not sure what 3D"URL" means. Normally when I would see a link in HTML it would be written like this
Judith Ehrlich
In quoted-printable, any non-standard octets are represented as an = sign followed by two hex digits representing the octet's value. To represent a plain =, it needs to be represented using quoted-printable encoding too: 3D are the hex digits corresponding to ='s ASCII value (61).
In other words, the sequence of =3D"URL" in those fields is converted to just ="URL". 3D"URL" without = has no meaning on its own.
If used in a parsing/rendering situation that is interpreting = as a QP encoding, omitting 3D would result in the parser wrongly interpreting the next characters (e.g. "U) as a byte encoding. Using =3D would be necessary to insert an actual = in the parsed result.
I want a simple, light-weight way for two basic 8-bit MCUs to talk to each other over an 8-bit UART connection, sending both ASCII characters as 8-bit values, and binary data as 8-bit values.
I would rather not re-invent the wheel, so I'm wondering if some ASCII implementation would work, using ASCII control characters in some standard way.
The problem: either I'm not understanding it correctly, or it's not capable of doing what I want.
The Wikipedia page on control characters says a packet could be sent like this:
< DLE > < SOH > - data link escape and start of heading
Heading data
< DLE > < STX > - data link escape and start of text
< payload >
< DLE > < ETX > - data link escape and end of text
But what if the payload is binary data containing two consecutive bytes equivalent to DLE and ETX? how should those bytes be escaped?
The link may be broken and re-established, so a receiving MCU should be able to start receiving mid-packet, and have a simple way of telling when the next packet has begun, so it can ignore data until the end of that partial packet.
Error checking will happen at a higher level to ensure that a received packet is valid - unless ASCII standads can solve this too
Since you are going to transfer binary data along with text messages, you indeed would have to make sure the receiver won't confuse control bytes with payload contents. One way to do that is to encode the payload data so that none of the special characters appear on the output. If the overhead is not a problem, then a simplest encoding like Base16 should be enough. Otherwise, you may want to take a look at escapeless encodings that have been specifically designed to remove certain characters from encoded data.
I understand this is an old question but I thought I should suggest Serial Line Internet Protocol (SLIP) which is defined in RFC 1055. It is a very simple protocol.
I have problems with reading text from an external XML.
Flash doesn't seem to have problem with ascii characters from (32-127), but it isn't able to show extended characters (128 - 255).
In that XML i have for example „ (DEC: 132) and “ (DEC:147).
In the XML those characters are not visible, but still there. Flash isn't able to show them. My approach was to get each charCode and convert it to string, but that does only work for printable characters.
var textToConvert:String = xml.parameters.text[1].value;
trace("LENGTH = "+textToConvert.length);
var test:String="";
for(var i:int=1;i<textToConvert.length;i++){
trace(textToConvert.charCodeAt(i));
//OCT
trace(textToConvert.charCodeAt(i).toString(8));
//HEX
trace(textToConvert.charCodeAt(i).toString(16));
//HEX
test += textToConvert.charCodeAt(i).toString(16);
trace("SYMBOL : " +String.fromCharCode(textToConvert.charCodeAt(i)))
}
trace("TEST: "+test);
Result:
76
114
4c
SYMBOL : L
132
204
84
SYMBOL : (Not Visible)
The next thing i was doing, is to attach an escape sequence to each char "\x" to the HEX-Value and then convert it to String, but that doesn't work either:
s = "\x93\x93\x84\x93\x84";
ba.writeMultiByte(s,"ASCII");
trace(s);
This was my first approach (not working):
var byteArray:ByteArray = new ByteArray();
byteArray.writeMultiByte(textToConvert,"iso-8859-1");
trace("HIER: "+byteArray.readUTFBytes(byteArray.bytesAvailable));
What would be an universal apporach to solve this problem?
This is the xml, it has hidden ascii characters (quotes). I want to parse the values of the nodes including those characters:
XML-DL
Internally AS3 strings are encoded as 16-bit Unicode. They support your characters. It has also decoded it correctly as it has read the correct char code.
Does the font used for output have a glyph capable of rendering it? This applies even to the AS3 console. Your char isn't "empty", it just can't draw it. If you changed your trace to include quotes either side of the character you would see it writes the empty space still.
If you dump it to a TextField instead using a font you know has the correct support then it should work as expected.
If this doesn't meet your needs then you will need to do some kind of conversion. There is no generally accepted library to do this, as it is dependent on your needs. What should be done with single chars that typically need multiple to represent them? ø is normally translated to 'oe' but that may not be suitable in a fixed length string. There isn't an equiv for a most Hebrew, Cyrillic, Arabic etc letters. What rules do you want to apply to those? You need to decide what you need then do a conversion that matches those requirements (or pick a library that meets it).
i try to get the file contents using TFilestream:
procedure ShowFileCont(myfile : string);
var
tr : string;
fs : TFileStream;
Begin
Fs := TFileStream.Create(myfile, fmOpenRead or fmShareDenyNone);
SetLength(tr, Fs.Size);
Fs.Read(tr[1], Fs.Size);
Showmessage(tr);
Fs.Free;
end;
I do a little text file with contents only:
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
And save this file (using AkelPad) with 1251 (ansi) codepege
Save with 65001 (UTF8) codepage.
these to files has different size but there contents is equal - i oped them both in notepad and they both has the same contents
But when i run ShowFileCont proc it shows to me different results:
aaaaaaaJ?ЊT?8?V?"?A?aaaaaaa
aaaaaaaJ“њРЉTщЂ®8ЈЏVд"Ј¦AИaaaaaaa
Questions:
how to get the real file contents using TFilestream?
How to explain that these 2 files has different size but the content (in notepad) is equeal?
Add: Sorry, i didn't say that i use Lazarus FPC and string = utf8string
Why do the files have different size?
Because they use different encodings. The 1251 encoding maps each character to a single byte. But UTF-8 uses variable numbers of bytes for each character.
How do I get the true file contents?
You need to use a string type that matches the encoding used in the file. So, for example, if the content is UTF-8 encoded, which is the best choice, then you load the content into a UTF-8 string. You are using FPC in a mode where string is UTF-8 encoded. In which case the code in the question is what you need.
Loading an MBCS encoded file with a code page of 1251, say, is more tricky. You can load that into an AnsiString variable and so long as your system's locale is 1251 then any conversions will be performed correctly.
But the code will behave differently when run on a machine with a different locale. And if you wanted to load text using different MBCS encodings, for example 1252, then you cannot use this approach. You would need to load into a byte array and then convert from 1252, say, to UTF-8 so that you could then store that UTF-8 in a string variable.
In order to do that you can use the LConvEncoding unit from LCL. For example, you can use CP1251ToUTF8, CP1252ToUTF8 etc. to convert from MBCS to UTF-8.
How can I determine from the file what encoding is used?
You cannot. You can make a guess that will be accurate in many cases. But in general, it is simply impossible to identify the encoding of an array of bytes that is meant to represent text.
It is sometimes possible to take a file and rule out certain encodings. For example, not all byte streams are valid UTF-8 or UTF-16 text. And so you can rule out such files. But for encodings like 1251, 1252 etc. then any byte stream is valid. There's simply no way for you to tell 1251 encoded streams apart from 1252 encoded streams with 100% accuracy.
The LConvEncoding unit has GuessEncoding which sounds like it may be of some use.
Their contents are obviously not equal. You can see for yourself that the file sizes are different. Things of different size are never equal.
Your files might appear equal in Notepad because Notepad knows how to recognize certain character encodings. You saved your file two different ways. One way used an encoding that assigns one byte to each of 256 possible values. The other way uses an encoding that assigns between one and six bytes to each of more than 10,000 possible values. Some of the characters you saved require more than one byte, which explains why one version of the file is bigger than the other.
TFileStream doesn't pay attention to any of that. It just deals with bytes. Depending on your Delphi version, your string variable may or may not pay attention to encodings. Prior to Delphi 2009, string stored one byte per character. As of Delphi 2009, string uses two bytes per character, so your SetLength call is wrong, and everything after that is pointless to investigate much further.
With one byte per character, your ShowMessage call is not going to interpret the string as UTF-8-encoded. Instead, it will interpret your string using whatever your system code page is. If you know that the string you've read is encoded with UTF-8, then you'll want to convert it to UTF-16 prior to display by calling UTF8Decode. That will return a WideString, and you can use any number of functions to display it, such as MessageBoxW. If you have Delphi 2009 or later, then the compiler will insert conversion code for you automatically, if you've used Utf8String instead of string.
I've been reading up on the RFC-4627 specification, and I've come to interpretation:
When advertising a payload as application/json mime-type,
there MUST be no BOMs at the beginning of properly encoded JSON streams (based on section "3. Encoding"), and
no media parameters are supported, thus a mime-type header of application/json; charset=utf-8 does not conform to RFC-4627 (based on section "6. IANA Considerations").
Are these correct deductions? Will I run into problem when implementing web-services or web-clients which adhere to this interpretations? Should I file bugs against web browsers which violate the the two properties above?
You are right
The BOM character is illegal in JSON (and not needed)
The MIME charset is illegal in JSON (and not needed as well)
RFC 7159, Section 8.1:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
This is put as clearly as it can be. This is the only "MUST NOT" in the entire RFC.
RFC 7159, Section 11:
The MIME media type for JSON text is application/json.
Type name: application
Subtype name: json
Required parameters: n/a
Optional parameters: n/a
[...]
Note: No "charset" parameter is defined for this registration.
JSON encoding
The only valid encodings of JSON are UTF-8, UTF-16 or UTF-32 and since the first character (or first two if there is more than one character) will always have a Unicode value lower than 128 (there is no valid JSON text that can include higher values of the first two characters) it is always possible to know which of the valid encodings and which endianness was used just by looking at the byte stream.
RFC recommendation
The JSON RFC says that the first two characters will always be below 128 and you should check the first 4 bytes.
I would put it differently: since a string "1" is also valid JSON there is no guarantee that you have two characters at all - let alone 4 bytes.
My recommendation
My recommendation of determining the JSON encoding would be slightly different:
Fast method:
if you have 1 byte and it's not NUL - it's UTF-8
(actually the only valid character here would be an ASCII digit)
if you have 2 bytes and none of them are NUL - it's UTF-8
(those must be ASCII digits with no leading '0', {}, [] or "")
if you have 2 bytes and only the first is NUL - it's UTF-16BE
(it must be an ASCII digit encoded as UTF-16, big endian)
if you have 2 bytes and only the second is NUL - it's UTF-16LE
(it must be an ASCII digit encoded as UTF-16, little endian)
if you have 3 bytes and they are not NUL - it's UTF-8
(again, ASCII digits with no leading '0's, "x", [1] etc.)
if you have 4 bytes or more than the RFC method works:
00 00 00 xx - it's UTF-32BE
00 xx 00 xx - it's UTF-16BE
xx 00 00 00 - it's UTF-32LE
xx 00 xx 00 - it's UTF-16LE
xx xx xx xx - it's UTF-8
but it only works if it is indeed a valid string in any of those encodings, which it may not be. Moreover, even if you have a valid string in one of the 5 valid encodings, it may still not be a valid JSON.
My recommendation would be to have a slightly more rigid verification than the one included in the RFC to verify that you have:
a valid encoding of either UTF-8, UTF-16 or UTF-32 (LE or BE)
a valid JSON
Looking only for NUL bytes is not enough.
That having been said, at no point you need to have any BOM characters to determine the encoding, neither you need MIME charset - both of which are not needed and not valid in JSON.
You only have to use the binary content-transfer-encoding when using UTF-16 and UTF-32 because those may contain NUL bytes. UTF-8 doesn't have that problem and 8bit content-transfer-encoding is fine as it doesn't contain NUL in the string (though it still contains bytes >= 128 so 7-bit transfer will not work - there is UTF-7 that would work for such a transfer but it wouldn't be valid JSON, as it is not one of the only valid JSON encodings).
See also this answer for more details.
Answering your followup questions
Are these correct deductions?
Yes.
Will I run into problem when implementing web-services or web-clients which adhere to this interpretations?
Possibly, if you interact with incorrect implementations. Your implementation MAY ignore the BOM for the sake of interoperability with incorrect implementations - see RFC 7159, Section 1.8:
In the interests of interoperability, implementations
that parse JSON texts MAY ignore the presence of a byte order mark
rather than treating it as an error.
Also, ignoring the MIME charset is the expected behavior of compliant JSON implementations - see RFC 7159, Section 11:
Note: No "charset" parameter is defined for this registration.
Adding one really has no effect on compliant recipients.
Security considerations
I am not personally convinced that silently accepting incorrect JSON streams is always desired. If you decide to accept input with BOM and/or MIME charset then you will have to answer those questions:
What to do in case of a mismatch between MIME charset and actual encoding?
What to do in case of a mismatch between BOM and MIME charset?
What to do in case of a mismatch between BOM and the actual encoding?
What to do when all of them differ?
What to do with encodings other than UTF-8/16/32?
Are you sure that all security checks will work as expected?
Having the encoding defined in three independent places - in a JSON string itself, in the BOM and in the MIME charset makes the question inevitable: what to do if they disagree. And unless you reject such an input then there is no one obvious answer.
For example, if you have a code that verifies the JSON string to see if it's safe to eval it in JavaScript - it might be misled by the MIME charset or the BOM and treat is as a different encoding than it actually is and not detect strings that it would detect if it used the correct encoding. (A similar problem with HTML has led to XSS attacks in the past.)
You have to be prepared for all of those possibilities whenever you decide to accept incorrect JSON strings with multiple and possibly conflicting encoding indicators. It's not to say that you should never do that because you may need to consume input generated by incorrect implementations. I'm just saying that you need to thoroughly consider the implications.
Nonconforming implementations
Should I file bugs against web browsers which violate the the two properties above?
Certainly - if they call it JSON and the implementation doesn't conform to the JSON RFC then it is a bug and should be reported as such.
Have you found any specific implementations that doesn't conform to the JSON specification and yet they advertise to do so?
I think you are correct about question 1, due to Section 3 about the first two characters being ASCII and the unicode FAQ on BOMs, see "Q: How I should deal with BOMs?", answer part 3. Your emphasis on MUST may be a bit strong: the FAQ seems to imply SHOULD.
Don't know the answer to question 2.