HTML Character Encoding - html

When outputting HTML content from a database, some encoded characters are being properly interpreted by the browser while others are not.
For example, %20 properly becomes a space, but %AE does not become the registered trademark symbol.
Am I missing some sort of content encoding specifier?
(note: I cannot realistically change the content to, for example, ® as I do not have control over the input editor's generated markup)

%AE is not valid for HTML safe ASCII,
You can view the table here: http://www.ascii.cl/htmlcodes.htm
It looks like you are dealing with Windows Word encoding (windows-1252?? something like that) it really will NOT convert to html safe, unless you do some sort of translation in the middle.

The byte AE is the ISO-8859-1 representation for the registered trademark. If you don't see anything, then apparently the URL decoder is using other charset to URL-decode it. In for example UTF-8, this byte does not represent any valid character.
To fix this, you need to URL-decode it using ISO-8859-1, or to convert the existing data to be URL-encoded using UTF-8.
That said, you should not confuse HTML(XML) encoding like ® with URL encoding like %AE.

The '%20' encoding is URL encoding. It's only useful for URLs, not for displaying HTML.
If you want to display the reg character in an HTML page, you have two options: Either use an HTML entity, or transmit your page as UTF-8.
If you do decide to use the entity code, it's fairly simple to convert them en-masse, since you can use numeric entities; you don't have to use the named entities -- ie use ® rather than &#reg;.
If you need to know entity codes for every character, I find this cheat-sheet very helpful: http://www.evotech.net/blog/2007/04/named-html-entities-in-numeric-order/

What server side language are you using? Check for a URL Decode function.

If you are using php you can use urldecode() but you should be careful about + characters.

Related

£ getting converted to ? by HTML Tidy, EncodingType?

I am cleaning a HTML file using HTML Tidy, well the .NET version called TidyManaged, and my "£" symbols are being converted to "?"
ie:
Income (£)
becomes:
Income (�)
I believe it is to do with encoding types. In TidyManaged, one can specify the input encoding type and output encoding type, including such things as Latin1, utf8, utf16, win1252.
The XHTML doc will ultimately gets converted into a DOC which uses win1252.
So what should my input and output encoding be to preserve £ symbols?
Many thanks.
Well, when I've used other char-sets it's always different. I'm not fluent in them but I do know that to create symbols, punctuation you need to use a 'code' rather than their literal. Never seen win1252 but google says it's 0x00A3.
Try putting that somewhere in your document.
I know in html I would put £ for a pound sign. So Html:
<p>£0.00</p>
Where I got the code

HTML files with no http-equiv meta tag and the charset may be other than UTF-8

we are using jsoup - excellent thanks.
We may get HTML files with no http-equiv meta tag and the charset may be other than UTF-8.
How is it best to handle this please. We can have a list of encodings and try them but I am not sure how to tell programatically if something is wrong. Would jsoup throw an IOException?
Jsoup will try to determine the encoding by the content type header or http equiv tag, if you have none of them it will use utf8. Not sure if jsoup can do more for you here.
But you can try another approach:
Implement a class that reads the files for you. There you can take care of all encoding issues. As a result such a class should give you proper encoded string or at least the encoding that's used for your input.
(html input) --> [encoding class] --normalized encoding--> [jsoup] --> (whatever)
Jsoup can now parse that input with a known encoding.
I guess changes on the html-creation thing is not possible, isn't it?
Some further readings:
http://illegalargumentexception.blogspot.co.uk/2009/05/java-rough-guide-to-character-encoding.html#javaencoding_autodetect
Character Encoding Detection Algorithm
What is the most accurate encoding detector? (includes a list of implementation)
Java Text File Encoding
Detect (or best guess of) incoming string encoding in Java

Why is "&reg" being rendered as "®" without the bounding semicolon

I've been running into a problem that was revealed through our Google adwords-driven marketing campaign. One of the standard parameters used is "region". When a user searches and clicks on a sponsored link, Google generates a long URL to track the click and sends a bunch of stuff along in the referrer. We capture this for our records, and we've noticed that the "Region" parameter is coming through incorrectly. What should be
http://ravercats.com/meow?foo=bar&region=catnip
is instead coming through as:
http://ravercats.com/meow?foo=bar®ion=catnip
I've verified that this occurs in all browsers. It's my understanding that HTML entity syntax is defined as follows:
&VALUE;
where the leading boundary is the ampersand and the closing boundary is the semicolon. Seems straightforward enough. The problem is that this isn't being respected for the ® entity, and it's wreaking all kinds of havoc throughout our system.
Does anyone know why this is occurring? Is it a bug in the DTD? (I'm looking for the current HTML DTD to see if I can make sense of it) I'm trying to figure out what would be common across browsers to make this happen, thus my looking for the DTD.
Here is a proof you can use. Take this code, make an HTML file out of it and render it in a browser:
<html>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</html>
EDIT: To everyone who's suggesting that I need to escape the entire URL, the example URLs above are exactly that, examples. The real URL is coming directly from Google and I have no control over how it is constructed. These suggestions, while valid, don't answer the question: "Why is this happening".
Although valid character references always have a semicolon at the end, some invalid named character references without a semicolon are, for backward compatibility reasons, recognized by modern browsers' HTML parsers.
Either you know what that entire list is, or you follow the HTML5 rules for when & is valid without being escaped (e.g. when followed by a space) or otherwise always escape & as & whenever in doubt.
For reference, the full list of named character references that are recognized without a semicolon is:
AElig, AMP, Aacute, Acirc, Agrave, Aring, Atilde, Auml, COPY, Ccedil,
ETH, Eacute, Ecirc, Egrave, Euml, GT, Iacute, Icirc, Igrave, Iuml, LT,
Ntilde, Oacute, Ocirc, Ograve, Oslash, Otilde, Ouml, QUOT, REG, THORN,
Uacute, Ucirc, Ugrave, Uuml, Yacute, aacute, acirc, acute, aelig,
agrave, amp, aring, atilde, auml, brvbar, ccedil, cedil, cent, copy,
curren, deg, divide, eacute, ecirc, egrave, eth, euml, frac12, frac14,
frac34, gt, iacute, icirc, iexcl, igrave, iquest, iuml, laquo, lt,
macr, micro, middot, nbsp, not, ntilde, oacute, ocirc, ograve, ordf,
ordm, oslash, otilde, ouml, para, plusmn, pound, quot, raquo, reg,
sect, shy, sup1, sup2, sup3, szlig, thorn, times, uacute, ucirc,
ugrave, uml, uuml, yacute, yen, yuml
However, it should be noted that only when in an attribute value, named character references in the above list are not processed as such by conforming HTML5 parsers if the next character is a = or a alphanumeric ASCII character.
For the full list of named character references with or without ending semicolons, see here.
This is a very messy business and depends on context (text content vs. attribute value).
Formally, by HTML specs up to and including HTML 4.01, an entity reference may appear without trailing semicolon, if the next character is not a name character. So e.g. &region= would be syntactically correct but undefined, as entity region has not been defined. XHTML makes the trailing semicolon required.
Browsers have traditionally played by other rules, though. Due to the common syntax of query URLs, they parse e.g. href="http://ravercats.com/meow?foo=bar&region=catnip" so that &region is not treated as an entity reference but as just text data. And authors mostly used such constructs, even though they are formally incorrect.
Contrary to what the question seems to be saying, href="http://ravercats.com/meow?foo=bar&region=catnip" actually works well. Problems arise when the string is not in an attribute value but inside text content, which is rather uncommon: we don’t normally write URLs in text. In text, &region= gets processed so that &reg is recognized as an entity reference (for “®”) and the rest is just character data. Such odd behavior is being made official in HTML5 CR, where clause 8.2.4.69 Tokenizing character references describes the “double standard”:
If the character reference is being consumed as part of an attribute,
and the last character matched is not a ";" (U+003B) character, and
the next character is either a "=" (U+003D) character or in the range
ASCII digits, uppercase ASCII letters, or lowercase ASCII letters,
then, for historical reasons, all the characters that were matched
after the U+0026 AMPERSAND character (&) must be unconsumed, and
nothing is returned.
Thus, in an attribute value, even &reg= would not be treated as containing a character reference, and still less &region=. (But reg_test= is a different case, due to the underscore character.)
In text content, other rules apply. The construct &region= causes then a parse error (by HTML5 CR rules), but with well-defined error handling: &reg is recognized as a character reference.
Maybe try replacing your & as &? Ampersands are characters that must be escaped in HTML as well, because they are reserved to be used as parts of entities.
1: The following markup is invalid in the first place (use the W3C Markup Validation Service to verify):
In the above example, the & character should be encoded as &, like so:
2: Browsers are tolerant; they try to make sense out of broken HTML. In your case, all possibly valid HTML entities are converted to HTML entities.
Here is a simple solution and it may not work in all instances.
So from this:
http://ravercats.com/meow?status=Online&region=Atlantis
To This:
http://ravercats.com/meow?region=Atlantis&status=Online
Because the &reg as we know triggers the special character ®
Caveat: If you have no control over the order of your URL query string parameters then you'll have to change your variable name to something else.
Escape your output!
Simply enough, you need to encode the url format into html format for accurate representation (ideally you would do so with a template engine variable escaping function, but barring that, with htmlspecialchars($url) or htmlentities($url) in php).
See your test case and then the correctly encoded html at this jsfiddle:
http://jsfiddle.net/tchalvakspam/Fp3W6/
Inactive code here:
<div>
Unescaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
<div>
Correctly escaped:
<br>
http://foo.com/bar?foo=bar&region=US&register=lowpass&reg_test=fail&trademark=correct
</div>
It seems to me that what you have received from google is not an actual URL but a variable which refers to a url (query-string). So, thats why it's being parsed as registration mark when rendered.
I would say, you owe to url-encode it and decode it whenever processing it. Like any other variable containing special entities.
To prevent this from happening you should encode urls, which replaces characters like the ampersand with a % and a hexadecimal number behind it in the url.

HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?

It is as the title says:
HTML - When using UTF-8 or ISO-8859-1 do I still need to type the codes for the special characters?
Or can I just type them normally?
Ex: I'm using UTF-8 in my HTML META tag. I need to type ç should I just type it or type its code which is ç
I know this is a trivial question, but it's fundamental so I just can't skip it.
No, you only need to use a character reference if:
The character you want cannot be represented in the character encoding you are using or
The character has some special meaning in HTML (such as < or &).
Note that declaring you are using UTF-8 in the meta tag is insufficient. You also have to encode the HTML source in UTF-8 (good editors will default to this) and not override it with a declaration of some other encoding in the real HTTP headers. You should also set the real HTTP headers to state that UTF-8 is being used.
Yes, you can include those characters directly in your HTML source, without using the entity for the character. Just make sure that the encoding you are saving the file in really does match what the web server serves it in.
The part about ensuring that the encoding is correct is important, and easy to get wrong. One thing to note is that the meta tag is not the primary source of information that the browser uses for interpreting the encoding of the document. The primary source of information is the Content-type header, sent as part of the HTTP headers. The meta tag was originally supposed to be used to communicate to the web server what Content-type to use, but most web servers use configuration separate from the document itself for this. So if you are saving your document as UTF-8, make sure that the web server is configured to serve pages as UTF-8 as well.
The meta tag is used by browsers as a fallback if the Content-type header is not provided or does not include valid encoding information. It is useful to have if you are ever going to be loading from a source that doesn't provide Content-type information, like using a file: URL to view the page on your local machine.
So, there are 3 places you should make sure your encoding is set up properly; in your text editor (so that it saves the file with the appropriate encoding), in your web server configuration (so that it communicates the appropriate encoding to the browser), and in the meta tag, so that when you view the page locally, it is displayed with the correct encoding.
Finally, you shouldn't use ISO-8859-1. That's a legacy encoding, only still supported for compatibility. Every major browser and text editor supports UTF-8 by now, which covers all of Unicode, and provides a lot fewer encoding headaches.

Displaying unicode symbols in HTML

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding.
I have set the meta tag to show utf-8 but obviously I'm missing something.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Edit/Solution: From comments made, using FireBug I found the headers being passed by my page were in fact "Content-Type: text/html" and not UTF-8. Looking at the file format using Notepad++ showed my file was formatted as "UTF-8 without BOM". Changing this to just UTF-8 the symbols now show correctly... but firebug still seems to indicate the same content-type.
You should ensure the HTTP server headers are correct.
In particular, the header:
Content-Type: text/html; charset=utf-8
should be present.
The meta tag is ignored by browsers if the HTTP header is present.
Also ensure that your file is actually encoded as UTF-8 before serving it, check/try the following:
Ensure your editor save it as UTF-8.
Ensure your FTP or any file transfer program does not mess with the file.
Try with HTML encoded entities, like &#uuu;.
To be really sure, hexdump the file and look as the character, for the ✔, it should be E2 9C 94 .
Note: If you use an unicode character for which your system can't find a glyph (no font with that character), your browser should display a question mark or some block like symbol. But if you see multiple roman characters like you do, this denotes an encoding problem.
I know an answer has already been accepted, but wanted to point a few things out.
Setting the content-type and charset is obviously a good practice, doing it on the server is much better, because it ensures consistency across your application.
However, I would use UTF-8 only when the language of my application uses a lot of characters that are available only in the UTF-8 charset. If you want to show a unicode character or symbol in one of cases, you can do so without changing the charset of your page.
HTML renderers have always been able to display symbols which are not part of the encoding character set of the page, as long as you mention the symbol in its numeric character reference (NCR). Sounds weird but its true.
So, even if your html has a header that states it has an encoding of ansi or any of the iso charsets, you can display a check mark by using its html character reference, in decimal - ✓ or in hex - ✓
So its a little difficult to understand why you are facing this issue on your pages. Can you check if the NCR value is correct, this is a good reference http://www.fileformat.info/info/unicode/char/2713/index.htm
Make sure that you actually save the file as UTF-8, alternatively use HTML entities (&#nnn;) for the special characters.
Unlike proposed by Nicolas, the meta tag isn’t actually ignored by the browsers. However, the Content-Type HTTP header always has precedence over the presence of a meta tag in the document.
So make sure that you either send the correct encoding via the HTTP header, or don’t send this HTTP header at all (not recommended). The meta tag is mainly a fallback option for local documents which aren’t sent via HTTP traffic.
Using HTML entities should also be considered a workaround – that’s tiptoeing around the real problem. Configuring the web server properly prevents a lot of nuisance.
I think this is a file problem, you simple saved your file in 1-byte encoding like latin-1. Google up your editor and how to set files to utf-8.
I wonder why there are editors that don't default to utf-8.