I have this piece of html from wikipedia.
Judith Ehrlich
I understand "=3D" is Quoted-Printable encoding for "=" but im not sure what 3D"URL" means. Normally when I would see a link in HTML it would be written like this
Judith Ehrlich
In quoted-printable, any non-standard octets are represented as an = sign followed by two hex digits representing the octet's value. To represent a plain =, it needs to be represented using quoted-printable encoding too: 3D are the hex digits corresponding to ='s ASCII value (61).
In other words, the sequence of =3D"URL" in those fields is converted to just ="URL". 3D"URL" without = has no meaning on its own.
If used in a parsing/rendering situation that is interpreting = as a QP encoding, omitting 3D would result in the parser wrongly interpreting the next characters (e.g. "U) as a byte encoding. Using =3D would be necessary to insert an actual = in the parsed result.
Related
For example, if I want the bullet point character in my HTML page, I could either type out • or just copy paste •. What's the real difference?
≺ is a sequence of 7 ASCII characters: ampersand (&), number sign (#), eight (8), eight (8), two (2), six (6), semicolon (;).
• is 1 single bullet point character.
That is the most obvious difference.
The former is not a bullet point. It's a string of characters that an HTML browser would parse to produce the final bullet point that is rendered to the user. You will always be looking at this string of ASCII characters whenever you look at your HTML's source code.
The latter is exactly the bullet point character that you want, and it's clear and precise to understand when you look at it.
Now, ≺ uses only ASCII characters, and so the file they are in can be encoded using pure ASCII, or any compatible encoding. Since ASCII is the de-facto basis of virtually all common encodings, this means you don't need to worry much about the file encoding and you can blissfully ignore that part of working with text files and you'll probably never run into any issues.
However, ≺ is only meaningful in HTML. It remains just a string of ASCII characters in the context of a database, a plain-text email, or any other non-HTML situation.
•, on the other hand, is not a character that can be encoded in ASCII, so you need to consciously choose an encoding which can represent that character (like UTF-8), and you need to ensure that you're sending the correct metadata to ensure that clients interpret the encoding correctly as well (HTTP headers, HTML <meta> tags, etc). See UTF-8 all the way through.
But • means • in any context, plain-text or otherwise, and does not need to be specifically HTML-interpreted.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
A character entity reference such as • works indepedently of the document encoding. It takes up more octets in the source (here: 7).
A character such as • works only with the precise encoding declared with the document. It takes up less octets in the source (here: 3, assuming UTF-8).
The JSON specification states that control characters that must be escaped are only with codes from U+0000 to U+001F:
7. Strings
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).
Main idea of escaping is to don't damage output when printing JSON document or message on terminal or paper.
But there other control characters like [DEL] from C0 and other control characters from C1 set (U+0080 through U+009F). Shouldn't be they also escaped in JSON strings?
From the JSON specification:
8. String and Character Issues
8.1. Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
In UTF-8, all codepoints above 127 are encoded in multiple bytes. About half of those bytes are in the C1 control character range. So in order to avoid having those bytes in a UTF-8 encoded JSON string, all of those code points would need to be escaped. This effectively eliminates the use of UTF-8 and the JSON string might as well be encoded in ASCII. As ASCII is a subset of UTF-8 this is not disallowed by the standard. So if you are concerned with putting C1 control characters in the byte stream just escape them, but requiring every JSON representation to use ASCII would be wildly inefficient in anything but an english environment.
UTF-16 and UTF-32 could not possibly be parsed by something that uses the C1 (or even C0) control characters so the point is rather moot for those encodings.
I want a simple, light-weight way for two basic 8-bit MCUs to talk to each other over an 8-bit UART connection, sending both ASCII characters as 8-bit values, and binary data as 8-bit values.
I would rather not re-invent the wheel, so I'm wondering if some ASCII implementation would work, using ASCII control characters in some standard way.
The problem: either I'm not understanding it correctly, or it's not capable of doing what I want.
The Wikipedia page on control characters says a packet could be sent like this:
< DLE > < SOH > - data link escape and start of heading
Heading data
< DLE > < STX > - data link escape and start of text
< payload >
< DLE > < ETX > - data link escape and end of text
But what if the payload is binary data containing two consecutive bytes equivalent to DLE and ETX? how should those bytes be escaped?
The link may be broken and re-established, so a receiving MCU should be able to start receiving mid-packet, and have a simple way of telling when the next packet has begun, so it can ignore data until the end of that partial packet.
Error checking will happen at a higher level to ensure that a received packet is valid - unless ASCII standads can solve this too
Since you are going to transfer binary data along with text messages, you indeed would have to make sure the receiver won't confuse control bytes with payload contents. One way to do that is to encode the payload data so that none of the special characters appear on the output. If the overhead is not a problem, then a simplest encoding like Base16 should be enough. Otherwise, you may want to take a look at escapeless encodings that have been specifically designed to remove certain characters from encoded data.
I understand this is an old question but I thought I should suggest Serial Line Internet Protocol (SLIP) which is defined in RFC 1055. It is a very simple protocol.
I have problems with reading text from an external XML.
Flash doesn't seem to have problem with ascii characters from (32-127), but it isn't able to show extended characters (128 - 255).
In that XML i have for example „ (DEC: 132) and “ (DEC:147).
In the XML those characters are not visible, but still there. Flash isn't able to show them. My approach was to get each charCode and convert it to string, but that does only work for printable characters.
var textToConvert:String = xml.parameters.text[1].value;
trace("LENGTH = "+textToConvert.length);
var test:String="";
for(var i:int=1;i<textToConvert.length;i++){
trace(textToConvert.charCodeAt(i));
//OCT
trace(textToConvert.charCodeAt(i).toString(8));
//HEX
trace(textToConvert.charCodeAt(i).toString(16));
//HEX
test += textToConvert.charCodeAt(i).toString(16);
trace("SYMBOL : " +String.fromCharCode(textToConvert.charCodeAt(i)))
}
trace("TEST: "+test);
Result:
76
114
4c
SYMBOL : L
132
204
84
SYMBOL : (Not Visible)
The next thing i was doing, is to attach an escape sequence to each char "\x" to the HEX-Value and then convert it to String, but that doesn't work either:
s = "\x93\x93\x84\x93\x84";
ba.writeMultiByte(s,"ASCII");
trace(s);
This was my first approach (not working):
var byteArray:ByteArray = new ByteArray();
byteArray.writeMultiByte(textToConvert,"iso-8859-1");
trace("HIER: "+byteArray.readUTFBytes(byteArray.bytesAvailable));
What would be an universal apporach to solve this problem?
This is the xml, it has hidden ascii characters (quotes). I want to parse the values of the nodes including those characters:
XML-DL
Internally AS3 strings are encoded as 16-bit Unicode. They support your characters. It has also decoded it correctly as it has read the correct char code.
Does the font used for output have a glyph capable of rendering it? This applies even to the AS3 console. Your char isn't "empty", it just can't draw it. If you changed your trace to include quotes either side of the character you would see it writes the empty space still.
If you dump it to a TextField instead using a font you know has the correct support then it should work as expected.
If this doesn't meet your needs then you will need to do some kind of conversion. There is no generally accepted library to do this, as it is dependent on your needs. What should be done with single chars that typically need multiple to represent them? ø is normally translated to 'oe' but that may not be suitable in a fixed length string. There isn't an equiv for a most Hebrew, Cyrillic, Arabic etc letters. What rules do you want to apply to those? You need to decide what you need then do a conversion that matches those requirements (or pick a library that meets it).
I've written a program to perform run length encoding.
In typical scenario if the text is
AAAAAABBCDEEEEGGHJ
run length encoding will make it
A6B2C1D1E4G2H1J1
but it was adding extra 1 for each non repeating character. Since i'm compressing BMP files with it, i went with an idea of placing a marker "$" to signify the occurance of a repeating character, (assuming that image files have huge amount of repeating text).
So it'd look like
$A6$B2CD$E4$G2HJ
For the current example it's length is the same, but there's a noticable difference for BMP files. Now my problem is in decoding. It so happens some BMP Files have the pattern $<char><num> i.e. $I9 in the original file, so in the compressed file also i'd contain the same text. $I9, however upon decoding it'd treat it as a repeating I which repeats 9 times! So it produces wrong output. What i want to know is which symbol can i use to mark the start of a repeating character (run) so that it doesn't conflict with the original source.
Why don't you encode each $ in the original file as $$ in the compressed file?
And/or use some other character instead of $ - one that is not used much in bmp files.
Also note that the BMP format has RLE compression 'built-in' - look here, near the bottom of the page - under "Image Data and Compression".
I don't know what you're using your program for, or if it's just for learning, but if you used the "official" bmp method, your compressed images wouldn't need decompression before viewing.
AAAAAABBCDEEEEGGHJ$IIIIIIIII ==> $A6$B2CD$E4$G2HJ$$I9
If the repeat character occurs in the data, try inserting an extra repeat character in the encoded data. Then if the decoder sees a double repeat character it can insert the actual repeat character
$A6$B2CD$E4$G2HJ$$I9 ==> AAAAAABBCDEEEEGGHJ$IIIIIIIII
What most programs do to signify that some character needs to be treated literally is that they have a defined escape sequence.
For example, in regular expressions, the following are specially defined characters that usually have a meaning:
^[].*+{}()$
Yes, your fun dollar sign character is in there, and it usually means end of line.
So what a programmer using regular expressions has to do to have these characters interpreted literally is that they need to express those characters as an escape sequence. For example, to interpret $ as $, and not end of line, the programmer uses \$, which is the escape sequence.(1)
In your case, you can store literal dollar signs into your compressed file as \$.(2)
NB: grep inverts this logic.
The above solutions to store $ as $$ becomes confusing when you have runs of $ in the BMP file.
If you have the luxury of being able to scan the entire input before starting to compress it, you could choose the least frequent value in the input as your escape value.
For example, given this input:
AAAABBCCCCDDEEEEEEEFFG
You could choose "G" as your escape value (or even "H" if it's part of your symbol set) and adopt a convention whereby the first character of the encoded stream is the escape value. So the string above might encode to:
GGA4BBGC4DDGE7FFGG
or even better:
HHA4BBHC4DDHE7FFG
Please note that there's no point in encoding a "run" of two identical values because the "compressed" version (e.g. HD2) is longer than the uncompressed version (DD).
Hope that helps!
If I understand correctly, the problem is that $ is both a symbol for marking a repeat, and also can be a 'BMP' value as well?
If so, what you could do is to mark a double $ ('$$') character to denote that the '$' character should be treated not as a repeat, but as a single '$'. This would of course mean that the '$' is expensive to encode (takes two symbols instead of 1), but would solve your problem.
If you wanted to have a run of the '$' character, you would need to encode it as:
$$$5 - meaning '$' run of '$$'=$, '5' - 5 times.
I'm honestly not sure what would possessed someone to use a text-based RLE if they want to compress binary data with it. A BMP is not text.
Right now, since only a single byte is read after the $, and it is interpreted as ascii number from 0 to 9, this process has a run length range of 0 to 9, meaning you can only compress values up to 9 repetitions before a new run-length flag needs to be written. After all, you can't make the difference between $I34 for a run-length of 34, and $I3 + 4 for a literal 4 behind the repeat of 3.
If this same byte is instead interpreted as binary value, it can contain values from 0 to 255, giving a massive difference in efficiency.
As for the escaping of $ signs themselves, I'd advice either always treating it as repeat of at least 1 ($$1), or, better yet, encoding the entire thing differently, with the order of the run length values and the data swapped, so a code becomes $<length><data>; then you can use $0 as special symbol to mean 'just $'. When decompressing and encountering the 0 after a $, simply don't read on for a third byte. A run length of 0 should never appear in the compressed data anyway, so it can be given a special meaning, but this is useless if the data byte is put first, since then it'd still be the same length as a normal repeat.