If I am reading an XML of HTML file, don't I have to read the tag that tells me the encoding to be able to read the file? Isn't that tag encoded the same way the file is? I am curious how you read that tag with out knowing the encoding. I realize this is solved problem. I am just curious how its done.
Update 1
I dont get it, in UTF-16 wont each character take 2 bytes, not one, and be different than ascii? For example the character E in UTF-16 (U+0045) is 0xfeff0045. That is 0xfeff then 0x0045, but some encodings change the endian of that. Do you have to figure it out by checkign for 0xfeff and realizing that can't be ASCII or something?
Here's what W3C has to say about it:
The XML encoding declaration functions
as an internal label on each entity,
indicating which character encoding is
in use. Before an XML processor can
read the internal label, however, it
apparently has to know what character
encoding is in use--which is what the
internal label is trying to indicate.
In the general case, this is a
hopeless situation. It is not entirely
hopeless in XML, however, because XML
limits the general case in two ways:
each implementation is assumed to
support only a finite set of character
encodings, and the XML encoding
declaration is restricted in position
and content in order to make it
feasible to autodetect the character
encoding in use in each entity in
normal cases.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing
The encoding name is limited to ([A-Za-z0-9._] |'-'), so it's identical for any encoding based on ASCII or ISO-646 (e.g. ISO 8859-*, ISO 10646/Unicode).
Edit: There are still some ambiguities though. For example, you still need to have some idea of whether to attempt to read 8-, 16-, or 32-bit chunks at a time to read it. There's also the minor detail that to be a proper UTF-16 or UTF-32/UCS-4 file, it should start with a BOM -- but the XML spec doesn't seem to allow inclusion of a BOM...
If, however, you know the file is supposed to contain XML, you have a pretty good idea of how the file needs to start, so an incorrect guess is easy to detect.
For HTML, it is documented in HTML5. (Don't read if you still believe anything is sane on the web, though.)
Related
JSON's official specification says:
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and...
So, essentially the JSON message can come in any of those three encodings. But... how do I guess which one is it when I receive it?
The message can come from multiple sources, such as a queue, from the browser, from the database, the file system, etc.
It also says to ignore Byte Order Masks (BOM):
...implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
I remember XML docs had a "prolog" that specified the encoding, but I can't find anything similar for JSON messages.
Any ideas?
rsp and CouchDeveloper have covered this pretty well with their answers (I can't take credit for those).
Both answers look at the byte patterns to determine what encoding has been used. Apologies this doesn't directly answer your question, but it may help you to write an implementation of your own.
As per my understanding, whoever is the producer/sender of this JSON data must specify the type of encoding used instead of the receiver trying to guess it. Usually this information is a part of API documentation that the producer/sender provides to the receiver.
I see that some Information like The Unicode Book and some Wikipedia Article tell us that Unicode is the default Character Set of HTML & XML.
I understand the words "Character Set" like the "repertorie" that you can use to work with when you are making a file. Which leads to some editors set his own default character sets regardless what kind of file is going to be worked. No matter if you are trying to make an HTML file, some editors don't set Unicode as default.
Which leaves the question that if Unicode is the default Character set of HTML and XML or depends of the editor used to create the file...
I suppose that you could call Unicode "the default" because both HTML and XML define their allowed content in terms of Unicode.
However, a file can't be "in Unicode," it has to be in some encoding of Unicode. By default, XML files are required to be in either UTF-8 or UTF-16 encoding, unless the prologue specifies differently. The HTML spec explicitly leaves the supported encodings undefined, and indicates that the encoding is handled by the transport protocol (eg, HTTP).
Depends on the person editing the document, not so much on the editor. The editor uses the encoding best suited to the author (or what they believe to be best suited to the author) as the default.
Basically, if you don't specify an encoding or if the client software do not recognize the headers that the server sends, it might/should default to unicode. I don't think that any of this is mandatory - it just became a commonplace behavior.
If I read your question correctly, you need to make a distinction between
the character set you have used
the character set you have declared
The character set you have actually used when you created the document is the one you have set in your editor. Now you need to make sure that consumers of your file will read it correctly, ie that the character set you have used is also the one you declare.
If you don't use a declaration, the default will be UTF-8 for XML documents, as you have said. That's what an application which reads your file will assume. So you better make sure your editor is set to UTF-8, or else use the appropriate XML header, e.g.
<?xml version="1.0" encoding="ISO-8859-1"?>
For HTML documents, the default encoding is usually set in the server config, so check that out. UTF-8 is the most common choice these days.
It's important to differentiate between the set of characters that may appear in an HTML document (which is a rather abstract concept), and the character encoding that is used to store/transfer the HTML file.
The default for the latter depends on OS/Browser/HTML editor settings, and it's definitely not Unicode, because Unicode is not an encoding. It may be "UTF-8", which is a character encoding for Unicode - just like e.g. "UTF-16" (these encodings are different than e.g. "ISO-8859-1", which cannot encode all Unicode characters).
Overall, it's important, that you set your editor to the same encoding which you declare in your HTML file. Some editors do this automatically, but many do not.
We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?
It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.
I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm
depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding
You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp
When I fetch data from a feed I store it in a table, the problem is that the format of the quote, so It will store ’ instead of ' (I hope you can see the difference)
You get the same thing when you copy paste code from a website or word document in your editor.
the problem is that when I display the content on my site I get the following, how to I get rid of that?
The problem relates to character sets. You need to find out what the character set of the feed is (how it's encoded) and also how your site is encoded too.
If the feed will never contain HTML markup then you can use htmlentities() otherwise you'll need to do conversion of the feed at input so that it matches up with the same charset as your site.
MySQL has good internationalization support too and would be able to perform this conversion.
Without knowning the specifics of your site it's hard to advise further
Echo the text like this on your page:
echo htmlentities($your_text_here);
James C already has the correct answer.
If your site is ISO-8859-1 encoded, and you are using the results of a UTF-8 encoded feed. In that case, a
utf8_decode($text);
would be a quick trick to make it work.
On the long run, it would be good to switch to UTF-8 altogether.
If you're outputting data from your database, you need to check the encoding of your
database tables
the mySQL connection
your page encoding
For more sophisticated character set conversion, there is iconv().
Excellent basic reading on the issue is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I've got an application where users input text into forms.
The data is saved into a MySQL database (collation: utf8_general_ci) and then output as XML (encoding: UTF-8).
The problem is that people tend to cut and paste their information from other sources, for instance, Microsoft Word documents or PDFs for instance.
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
This causes problems, obviously, when transforming or otherwise working on the XML because the characters are illegal.
So, how to sanitise the input?
Previously, I've used some fairly brute-force methods, things like the "de-moronize" script which consists of a long list of search-and-replace operations.
Is this still the best way to do it? Is there any other way?
Can I just set the accept-charset attribute on the form and have the browser do it for me?
If so, which browsers will do that and are there likely to be any problems?
Also, how come my database is accepting these characters, which are reserved/control characters in UTF-8?
As you can see, I know enough about encodings to know I have a problem, but I'm now a bit out of my depth...
TIA
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.
Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.
If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:
Content-Type: text/html;charset=utf-8
HTTP header and/or the:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element in <head>.
Can I just set the accept-charset attribute on the form and have the browser do it for me?
No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!
how come my database is accepting these characters, which are reserved/control characters in UTF-8?
In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.
It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).
ETA:
So, I entered a string containing byte 146
No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.
(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘’’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)
It went into the database as 146
Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘’?
it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser
Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)
I suspect it is coming out as a ‘’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.
You might try the Perl Encode module. It supports conversion between a number of character sets, including UTF-8 of couse. I just checked my install of Perl and it also supported "cp1252", which is just another name for Windows-1252 according to Wikipedia. You can check your own install with the following one liner:
perl -MEncode -e 'print map {"$_\n"} Encode->encodings(":all");'
"Can I just set the accept-charset attribute on the form and have the browser do it for me?"
Only if you're prepared to trust "the browser" - that might be suitable in some applications, but in general it's leaving yourself wide open to mischief (or worse).
(Also see bobince's warnings about IE...)
Iain