Is it advisable to have non-ascii characters in the URL? - language-agnostic

We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?

It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.

I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm

depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding

You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp

Related

Using HTML Symbol Entities instead of the actual symbol

Is there any particular reason I should use HTML symbol entities instead of the actual symbol (I mean the one which I can just type)? For example the symbol /; the HTML entity code for it is &#47.
Should I use the symbol's code or the symbol itself in my HTML code, and why?
Using an HTML entity reference allows the entity to be represented as intended regardless of the encoding applied to the document. That is the benefit.
Rather than strictly using entities for all non-US-ASCII characters, feel free to use an encoding for your document that supports the document's target language, preferably one also supporting other languages, like UTF-8.
However, please avoid using any system-specific encoding, especially regular Windows encoding. It is often the case that Windows-1252 text is sent to other systems with the wrong label of ISO-8859-1.
In the past there has certainly been been less reliable support for numeric HTML entities than for named HTML entities (based on my own first-person eye witness observation), but in theory a numeric HTML entity is still character encoding independent and "safe" because the numeric value refers directly to a code point registered in the UCS (http://en.wikipedia.org/wiki/Universal_Character_Set) and equivalent to its defined character name.
Caveat: the following describes my own experience, and yours may vary.
HTML documents transferred by clients for me to work on with symbols directly embedded are very often corrupted and cannot be recovered. This may be a weakness of U.S. infrastructure or a lack of knowledge on the part of my customers about how to send their documents. The infrastructure and people in a country whose primary language relies on non-ASCII characters would be much more likely to support and understand how to properly transfer their documents with no corruption.
If you are developing your own website and uploading the final copies of your own files to your server, then the risk of corruption is very small.
If you do not have control over your document from the point you edit it to the point that it is served to users, then you run the risk (perhaps not today, but certainly within recent years in the U.S., a likelihood more than mere risk) of having the document improperly converted at some point along the way and being permanently corrupted regardless of what encoding you attempt to view it in.
No.
Entities and character references are useful only if:
The character has special meaning in HTML at the point where you want to use the character (/ never will, it only has special meaning in places where you can't have a / as data anyway).
You can't type the character (e.g. because it doesn't appear on your keyboard).
You can't encode the file as UTF-8 (or in another encoding that includes it … and / appears in ASCII).
Unless you know for a fact that you will always be using the same software and computer system to edit your HTML, you will inevitably run into situations where you cannot edit your own code if you directly use symbols, regardless of what character encoding you specify in your document or with your HTTP headers. Only in a perfect world does the character encoding always properly transfer, and even then neither Macintosh nor Windows truly does it correctly.
If I open up a supposedly "properly" encoded document from either Macintosh or Windows in software that truly supports all available encoding systems, I see a message like this:
-=-J(DOS)**--F1 Top L3 (Text) ----------------------------------------
These default coding systems were tried to encode text
in the buffer:
(iso-2022-7bit-dos (284 . 4194194) (379 . 4194194) (462 . 4194195)
(492 . 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
. 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
(utf-8-dos (284 . 4194194) (379 . 4194194) (462 . 4194195) (492
. 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
. 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
However, each of them encountered characters it couldn't encode:
iso-2022-7bit-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
utf-8-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.
Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).
thai-tis620
Remember that as soon as the data is off of your server, e.g., placed in an email, etc., there is no guarantee the encoding is passed along, and chances are that it is not. Byte marks and other invisible means of identifying documents do not work as promised, let alone transient methods such as HTTP headers which are lost as soon as the document moves beyond the context of your own carefully configured HTTP server.
The guiding principle of HTML is that it is a plain text markup language that, when properly used, is universally compatible with any system supporting the most basic of text. HTML documents should use HTML entities for any characters outside of the normal 7-bit US-ASCII character set. Any other characters have different binary definitions depending on the encoding used and may even vary between single-byte and multi-byte representations.
Within Non-HTML documents you can feel free to use raw symbols because when you embed them within either their native file format or within HTML you can ensure that you specify the "right" character encoding, i.e., the one that will be recognized by the system where you authored it and any system compatible with that.

How to read the encoding header without knowing the encoding?

If I am reading an XML of HTML file, don't I have to read the tag that tells me the encoding to be able to read the file? Isn't that tag encoded the same way the file is? I am curious how you read that tag with out knowing the encoding. I realize this is solved problem. I am just curious how its done.
Update 1
I dont get it, in UTF-16 wont each character take 2 bytes, not one, and be different than ascii? For example the character E in UTF-16 (U+0045) is 0xfeff0045. That is 0xfeff then 0x0045, but some encodings change the endian of that. Do you have to figure it out by checkign for 0xfeff and realizing that can't be ASCII or something?
Here's what W3C has to say about it:
The XML encoding declaration functions
as an internal label on each entity,
indicating which character encoding is
in use. Before an XML processor can
read the internal label, however, it
apparently has to know what character
encoding is in use--which is what the
internal label is trying to indicate.
In the general case, this is a
hopeless situation. It is not entirely
hopeless in XML, however, because XML
limits the general case in two ways:
each implementation is assumed to
support only a finite set of character
encodings, and the XML encoding
declaration is restricted in position
and content in order to make it
feasible to autodetect the character
encoding in use in each entity in
normal cases.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing
The encoding name is limited to ([A-Za-z0-9._] |'-'), so it's identical for any encoding based on ASCII or ISO-646 (e.g. ISO 8859-*, ISO 10646/Unicode).
Edit: There are still some ambiguities though. For example, you still need to have some idea of whether to attempt to read 8-, 16-, or 32-bit chunks at a time to read it. There's also the minor detail that to be a proper UTF-16 or UTF-32/UCS-4 file, it should start with a BOM -- but the XML spec doesn't seem to allow inclusion of a BOM...
If, however, you know the file is supposed to contain XML, you have a pretty good idea of how the file needs to start, so an incorrect guess is easy to detect.
For HTML, it is documented in HTML5. (Don't read if you still believe anything is sane on the web, though.)

How to deal with HTML-entities for publishing multilingual content

In case of publishing any text online as a HTML page – I face the problem of the correct reflection of symbols of several languages which require extended Latin character encoding. In this case I’m searching the Entity (hex) from the list on this site http://theorem.ca/~mvcorks/code/charsets/auto.html . I wonder If it’s possible to save my time via definition of any meta-tags and their attributes.
Any advice would be much appreciated.
Thanks.
Vitaly Repin
I recommend you to use the Unicode charset and encode the characters with UTF-8.
Unicode contains probably all characters you’ll need and UTF-8 is the most efficient encoding for the Unicode charset concerning the code word lengths. If you’re using UTF-8, you don’t need the HTML character references as you can use the character they represent themselves.
Just write your text with the plain characters, tell your editor to save it using UTF-8 as character encoding, and tell your web server to serve the document with UTF-8.

Formatting problems when fetching feeds

When I fetch data from a feed I store it in a table, the problem is that the format of the quote, so It will store ’ instead of ' (I hope you can see the difference)
You get the same thing when you copy paste code from a website or word document in your editor.
the problem is that when I display the content on my site I get the following, how to I get rid of that?
The problem relates to character sets. You need to find out what the character set of the feed is (how it's encoded) and also how your site is encoded too.
If the feed will never contain HTML markup then you can use htmlentities() otherwise you'll need to do conversion of the feed at input so that it matches up with the same charset as your site.
MySQL has good internationalization support too and would be able to perform this conversion.
Without knowning the specifics of your site it's hard to advise further
Echo the text like this on your page:
echo htmlentities($your_text_here);
James C already has the correct answer.
If your site is ISO-8859-1 encoded, and you are using the results of a UTF-8 encoded feed. In that case, a
utf8_decode($text);
would be a quick trick to make it work.
On the long run, it would be good to switch to UTF-8 altogether.
If you're outputting data from your database, you need to check the encoding of your
database tables
the mySQL connection
your page encoding
For more sophisticated character set conversion, there is iconv().
Excellent basic reading on the issue is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

How do I sanitize user input for proper content-encoding before I save it?

I've got an application where users input text into forms.
The data is saved into a MySQL database (collation: utf8_general_ci) and then output as XML (encoding: UTF-8).
The problem is that people tend to cut and paste their information from other sources, for instance, Microsoft Word documents or PDFs for instance.
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
This causes problems, obviously, when transforming or otherwise working on the XML because the characters are illegal.
So, how to sanitise the input?
Previously, I've used some fairly brute-force methods, things like the "de-moronize" script which consists of a long list of search-and-replace operations.
Is this still the best way to do it? Is there any other way?
Can I just set the accept-charset attribute on the form and have the browser do it for me?
If so, which browsers will do that and are there likely to be any problems?
Also, how come my database is accepting these characters, which are reserved/control characters in UTF-8?
As you can see, I know enough about encodings to know I have a problem, but I'm now a bit out of my depth...
TIA
This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.
Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.
If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:
Content-Type: text/html;charset=utf-8
HTTP header and/or the:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element in <head>.
Can I just set the accept-charset attribute on the form and have the browser do it for me?
No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!
how come my database is accepting these characters, which are reserved/control characters in UTF-8?
In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.
It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).
ETA:
So, I entered a string containing byte 146
No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.
(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘’’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)
It went into the database as 146
Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘’’?
it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser
Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)
I suspect it is coming out as a ‘’’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.
You might try the Perl Encode module. It supports conversion between a number of character sets, including UTF-8 of couse. I just checked my install of Perl and it also supported "cp1252", which is just another name for Windows-1252 according to Wikipedia. You can check your own install with the following one liner:
perl -MEncode -e 'print map {"$_\n"} Encode->encodings(":all");'
"Can I just set the accept-charset attribute on the form and have the browser do it for me?"
Only if you're prepared to trust "the browser" - that might be suitable in some applications, but in general it's leaving yourself wide open to mischief (or worse).
(Also see bobince's warnings about IE...)
Iain