I am a little bit confused about the whole encoding issues related to HTML. I am not refering to the charset in the headers or encoding in the XML prologue. That I get. Lets me explain.
When the "mailto:" is used along with a anchor or a submit button in a form, white space is encoded as "%20" and "line feed/carriage return/new line/end of line" is encoded as %0A. While when the enctype attribute is used on a form with a value of "application/x-www-form-urlencoded" the white space is encoded as "+" and special characters, apostrophes, percentage and other symbols are converted to their ASCII HEX equivalents. Is the value "application/x-www-form-urlencoded" an URL Encoding? So why "%20" for the first one and "+" for the second.
"mailto:someone#someplace.com?cc=carbon#copy.com&bcc=blind#carbobcopy.org&subject=This%20is%20the%20subject&body=This%20is%20the%body%0AThis%20is%20the%20second%20paragraph"
In the above example white space in the subject is encoded as %20 and new line in the body is encoded as %0A.
<form enctype="application/x-www-form-urlencoded"></form>
And in the above white space will be encoded to "+". Am I missing something?
Thanks in advance.
URIs (like your mailto example) should be encoded according to RFC 3986, which specifies that spaces are to be encoded as %20.
The format of FORM data, on the other hand, is encoded as application/x-www-form-urlencoded according to the rules defined by the HTML specification. (See, for example, section 17.13.3.3 of the HTML 4.01 specification.) This specifies that spaces are to be translated as + signs.
Thus, while percent encoding is similar between URIs and form data, the space character is treated differently.
Related
I have the following code
<td colspan="#missingGridColumnCount">** <span translate="MissingItems">.MissingInstruments</span> **</td>
This prints correctly through the browser but when I print to my Zebra printer, I get the following on the label:
**_áMissing Items_á**
I have looked through Zebra Label documentation but cannot find a way to convert this or accept the for the labels.
This is a character encoding issue.
The probable chain of events is this:
The browser is rendering the entity into the Unicode code point "U+00A0 NO-BREAK SPACE".
This is being encoded in UTF-8, as the sequence of bytes C2 A0.
These bytes are being interpreted by the Zebra printer according to Code page 850, where C2 is mapped to "┴" (U+2534 BOX DRAWINGS LIGHT UP AND HORIZONTAL) and A0 to "á" (U+00E1 LATIN SMALL LETTER A WITH ACUTE).
In code page 850, a non-breaking space is represented by the byte FF.
You may be able to tell the whatever is interpreting the HTML to use Code page 850 instead of UTF-8, and it will send the byte sequences the printer is expecting. You will need to make sure your input doesn't contain any literal UTF-8 - escape all non-ASCII characters as HTML entities.
Otherwise, you will need to substitute byte-wise before sending to the printer, or encode in some other way.
As far as I know, URL encoding exists because URLs only support ASCII encoding. But since " is already in the ASCII table, why should it be encoded as %22 in URL encoding?
The " character falls under section 2.2 (URL Character Encoding Issues) of RFC 1738 (Uniform Resource Locators), under the "Unsafe" section. The reason for the inclusion is:
The quote mark (""") is used to delimit URLs in some systems.
One case of this that I can think of is an HTML attribute. For example, if you have an <a> tag with an href attribute, you will likely enclose the URL between double quotes. If the " character is not quoted, then the tag becomes invalid:
...
The RFC also proceeds to say:
All unsafe characters must always be encoded within a URL.
Some examples of other unsafe characters:
The characters "<" and ">" are unsafe because they are used as the delimiters around URLs in free text.
The character "%" is unsafe because it is used for encodings of other characters.
The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it.
URLs only support ASCII encoding
That's not true. URL's don't support spaces or / or & or ? for example even though they are valid ASCII characters because they have special meaning in URLs.
Valid characters in URLs are:
A-Z
a-z
0-9
-
_
.
~
Other characters are not supported. Some, such as spaces and tabs are not supported because they have special meaning in protocols that usually use URLs such as HTTP. Others such as ? and & are not supported because they have special meaning in URL syntax.
In JSP file, the source code is
|13|<%="\u0031\u0080\u0033" %>|
The result on the page is:
|1€3|13|
Why is the Euro symbol represented differently ?
The HTML numerical character references in the range 0x80–0x9F don't actually correspond to the characters U+0080–U+009F. Instead, they refer to the characters mapped into the bytes 0x80–0x9F from the windows-1252 encoding.
This is a weird historical artefact from the days before browsers did Unicode. HTML5 sort-of standardises it, in that although it's invalid parsers are required to parse it this way. This does not happen in XML/XHTML.
So \u0080 gives you the actual character U+0080, which you can't see because it's an invisible control character, but gives you code page 1252 byte 0x80, which is U+20AC Euro Sign.
I am having a issue with characters in my XML when I view it on a website. The character I want to be put in is § and what is coming out is § and my xml is <?xml version="1.0" encoding="UTF-8"?>. Any suggestions? Thanks!!
If you see “§” as “§”, then the reason is usually that the data contains “§” SECTION SIGN U+00A7 as UTF-8 encoded, as bytes 0xC2 0xA7, but it is being misinterpreted as being in an 8-bit encoding like windows-1252 or ISO-8859-1. Alternatively, an incorrect character code conversion (“double UTF-8 encoding”) has taken place.
Check out the HTTP headers of the web page. If they declare an encoding other than UTF-8, they may override the in-document declaration.
instead of the character § you can use its html code which is either § or §.
have a look here Ascii Code, every ascii symbol has a dedicated html code that can be used instead of the symbol.
like the unbreakable space which i am sure you are familiar with:  
I have a normal html form with the action to http://another-site.com.
My website(http://my-site.com) is encoded with UTF-8, but another-site is encoded with GBK.
The problem is, when I submit my form from my-site.com, and then the page forward to another-site.com, which is encoded with GBK as i mentioned. The page's characters are totally messy.
Is it my problem ? How do I tell the browser to use GBK in another-site.com ?
NOTE : Both another-site.com and my-site.com have set content-type with its encoding type.