Valid domain name registration for Unicode characters - google-chrome

I'm trying to figure out what is valid for domain name registration, apparently some Unicode characters are translated weirdly while others do not at all.
This address:
http://xn--ippleman-dmj.com/
Translates to:
http://Nippleman.com/
and
http://xn--ggle-0nda.com/
should translate to:
http://gοοgle.com/
but for some reason the browser prevents it.
How is the format for these domains determined, and what is or isn't blocked by the browser?
http://xn--ippleman-dmj.com/ is a valid URL, while http://www.gοοgle.com is not. Yet Chrome only replaces the Unicode on the second URL.

It appears that you're trying to do an IDN homograph attack. The Wikipedia page nicely explains what Chrome is doing to stop you.

First, to your question.
The valid domain name must conform to RFC1035 regardless of browser, i.e. the whole domain name must not exceed 255 valid ASCII character (in octet) and it is case insensitive. Even IDN must comply with this standard. So to display IDN, RFC evolve come out with the Punycode 'xn--' conversion idea.
Then there is proof of concept of IDN homograph attack. Currently, Unicode.org update and maintain a confusable list. You can download current version TR39 and play around with it.
Previously, Chrome and firefox will translate domain name start with xn-- to correspondence Unicode found inside the browser font cache. If the browser can't find the font, it will display the raw 'xn--' punycode domain name.
This is known issues. Firefox even has manual option to enable/disable the Punycode domain name display. Google decide to remove the conversion post version 58+, while Firefox 53 will follow to make display Punycode as default.
I don't know whether Google will show Unicode(s) not inside TR39 or just remove the Punycode to Unicode conversion for all.

Related

How can I display ë character in the url?

I have an url that contains this ë character. Is there any way so that I can display as ë in front end, but in backend it can be converted to ASCII value %C3%AB of this character. When you view this question particular page url will display ë character. So I want same thing to be display. Thanks in advance for any suggestion.
Well, you'd do good to look at the HTML for this page then:
<a href="/questions/38720183/how-can-i-display-%c3%ab-character-in-the-url">
You must use the correctly URL-encoded version, %c3%ab. The browser may then decide to render it as "ë". That's entirely up to the browser, and it won't do it for all characters, specifically it won't decode particular lookalike characters which can be used to spoof a URL to look identical to another URL but actually be different.
You should use percent-encoding which is
a mechanism for encoding information in a Uniform Resource Identifier
(URI) under certain circumstances. Although it is known as URL
encoding it is, in fact, used more generally within the main Uniform
Resource Identifier (URI) set, which includes both Uniform Resource
Locator (URL) and Uniform Resource Name (URN). As such, it is also
used in the preparation of data of the
application/x-www-form-urlencoded media type, as is often used in the
submission of HTML form data in HTTP requests.
There is a website http://www.url-encode-decode.com/ that will do it for you

Forward slashes in http protocol declaration in URL

I have just noticed that on HTML form validation for an input type of url that the double forward slash '//' after the protocol: is not required. I tried entering URLs in to many browsers without the forward slashes and they all work e.g. http:www.web-dewd.com works in Chrome, Firefox, Edge, Opera, and dare I say it, even IE11.
I cannot find any specific definition which states whether they are required or not. I spent a good few minutes on https://www.w3.org/standards/ without any luck. The best I could find was an interview with Tim Berners-Lee stating they are not required: http://www.dailymail.co.uk/sciencetech/article-1220286/Sir-Tim-Berners-Lee-admits-forward-slashes-web-address-mistake.html :
But with the colon in there as well, it turns out people never use the slash slash...
This article from ZDNet states:
there is practically no reference to the double forward-slash on the web
I would argue that slashes are recommended, but does anyone know and is able to provide evidence of what the correct standard is?
Somewhat ironically, Stackoverflow does require // when entering a link, as do other editors when determining to convert text to a link e.g. Microsoft Outlook.
Source
PrePrefix: To be a Uniform Resource Locator as currently defined by the URI
working group, the whole string must start with a constant prefix
"URL:"
this part says that valid URL starts with protocol: and does not states anything about //
Internet protocol parts Those schemes which refer to internet protocols mostly have a
common syntax for the rest of the object name. This starts with a
double slash "//" to indicate its presence, and continues until the
following slash "/".
To indicate URL string must start with protocol: and // is just common syntax to indicate domain name start.
When replacing URL usually you look for http[s]:// instead of http[s]:. It's just common practice, and does not mean that all web developers will use that.

Which charset should I use in mailto body for maximum compatibilty?

I have UTF-8 page in which I have mailto links with template of body. But these template with data includes East Europian (Czech) characters. These characters messes up some email clients like Outlook 2007 and they are displayed like question marks or some other strange characters.
I know about "Enable UTF-8 support for mailto: protocol." setting in Outlook 2007, but from what I know about it's off by default.
Which charset-encoding should I use in mailto body for maximum compatibilty?
it clearly depends on client settings. in the company i work for we had a similar question. we ended up using ae for ä and so on.
we faced the problem that some clients running ms outlook 2003 were reading it as utf-8 while other clients running outlook 2003 were reading it as a defined ansi table.
another issue is involved in the usage of other mail clients. we also have a bring your own device program where employees could install linux or have mac books. they are using mozilla thunderbird, mutt and so on. other clients have set gmail as their prefered mailto: handler etc. so a global change of the outlook settings through a registry edit as in Rfilip's answer would not apply to us.
anyways.. you can only define the text body of the mail. sadly its impossible to define a html body through a mailto: link. if this would work, encoding could be defined through the body.
there also is no such a flag to set up the prefered encoding inside the mailto: link.
answer to this question: its impossible to detect or set the mail clients encoding so you should switch to characters that would work in every ansi table.
as alternative you could provide two links. one encoded with utf-8 with a note that, if it will display weird, the user might use the second mailto link.
or.. you provide one mailto link aswell as a drop down box to select encoding. (including a brief description of what encoding is as hover ontop of the dropdown)
It depends on the characters you want to use and your clients setups...
it may be Windows1250 ( http://en.wikipedia.org/wiki/Windows-1250 )
So, on G+ I was pointed to answer on other thread -https://stackoverflow.com/a/1831416/1190066. It seems like that different charsets fails on some combinations of browser/OS/email client. That nothing works on everything.
So I'll use the fact that it's intarnet app and it's Win 7, Chrome, Outlook 2007 setup and I'll convince admin to enable "Enable UTF-8 support for mailto: protocol." by registry key change on client company computers. - https://social.technet.microsoft.com/Forums/en-US/183b2442-6750-4e18-b61d-d87ce5f3aac3/outlook-2007-utf8-mailto-protocol-how-to-set-this-parameter-in-thousand-machines-?forum=officesetupdeploylegacy
HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Protocols\Mailto
UTF8Encoding
The value “1” turn on the “Enable UTF-8 support for mailto: protocol”
The value “0” turn off
But this workround is viable only when you can modify settings of all client's computer.
So I was looking for posibilities that don't include that change. But I see that they arent.

What characters are allowed in the HTML Name attribute inside input tag?

I have a PHP script that will generate <input>s dynamically, so I was wondering if I needed to filter any characters in the name attribute.
I know that the name has to start with a letter, but I don't know any other rules. I figure square brackets must be allowed, since PHP uses these to create arrays from form data. How about parentheses? Spaces?
Note, that not all characters are submitted for name attributes of form fields (even when using POST)!
White-space characters are trimmed and inner white-space characters as well the character . are replaced by _.
(Tested in Chrome 23, Firefox 13 and Internet Explorer 9, all Win7.)
Any character you can include in an [X]HTML file is fine to put in an <input name>. As Allain's comment says, <input name> is defined as containing CDATA, so the only things you can't put in there are the control codes and invalid codepoints that the underlying standard (SGML or XML) disallows.
Allain quoted W3 from the HTML4 spec:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire ISO10646 character set.
However this isn't really true in practice.
The theory is that application/x-www-form-urlencoded data doesn't have a mechanism to specify an encoding for the form's names or values, so using non-ASCII characters in either is “not specified” as working and you should use POSTed multipart/form-data instead.
Unfortunately, in the real world, no browser specifies an encoding for fields even when it theoretically could, in the subpart headers of a multipart/form-data POST request body. (I believe Mozilla tried to implement it once, but backed out as it broke servers.)
And no browser implements the astonishingly complex and ugly RFC2231 standard that would be necessary to insert encoded non-ASCII field names into the multipart's subpart headers. In any case, the HTML spec that defines multipart/form-data doesn't directly say that RFC2231 should be used, and, again, it would break servers if you tried.
So the reality of the situation is there is no way to know what encoding is being used for the names and values in a form submission, no matter what type of form it is. What browsers will do with field names and values that contain non-ASCII characters is the same for GET and both types of POST form: it encodes them using the encoding the page containing the form used. Non-ASCII GET form names are no more broken than everything else.
DLH:
So name has a different data type for than it does for other elements?
Actually the only element whose name attribute is not CDATA is <meta>. See the HTML4 spec's attribute list for all the different uses of name; it's an overloaded attribute name, having many different meanings on the different elements. This is generally considered a bad thing.
However, typically these days you would avoid name except on form fields (where it's a control name) and param (where it's a plugin-specific parameter identifier). That's only two meanings to grapple with. The old-school use of name for identifying elements like <form> or <a> on the page should be avoided (use id instead).
The only real restriction on what characters can appear in form control names is when a form is submitted with GET
"The "get" method restricts form data set values to ASCII characters." reference
There's a good thread on it here.
While Allain's comment did answer OP's direct question and bobince provided some brilliant in-depth information, I believe many people come here seeking answer to more specific question: "Can I use a dot character in form's input name attribute?"
As this thread came up as first result when I searched for this knowledge I guessed I may as well share what I found.
Firstly, Matthias' claimed that:
character . are replaced by _
This is untrue. I don't know if browser's actually did this kind of operation back in 2013 - though, I doubt that. Browsers send dot characters as they are(talking about POST data)! You can check it in developer tools of any decent browser.
Please, notice that tiny little comment by abluejelly, that probably is missed by many:
I'd like to note that this is a server-specific thing, not a browser thing. Tested on Win7 FF3/3.5/31, IE5/7/8/9/10/Edge, Chrome39, and Safari Windows 5, and all of them sent " test this.stuff" (four leading spaces) as the name in POST to the ASP.NET dev server bundled with VS2012.
I checked it with Apache HTTP server(v2.4.25) and indeed input name like "foo.bar" is changed to "foo_bar". But in a name like "foo[foo.bar]" that dot is not replaced by _!
My conclusion: You can use dots but I wouldn't use it as this may lead to some unexpected behaviours depending on HTTP server used.
Do you mean the id and name attributes of the HTML input tag?
If so, I'd be very tempted to restrict (or convert) allowed "input" name characters into only a-z (A-Z), 0-9 and a limited range of punctuation (".", ",", etc.), if only to limit the potential for XSS exploits, etc.
Additionally, why let the user control any aspect of the input tag? (Might it not ultimately be easier from a validation perspective to keep the input tag names are 'custom_1', 'custom_2', etc. and then map these as required.)

Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?
Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.
My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.
All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:
URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
HTTP Client libraries
Exotic browsers, RSS readers
Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?
Is there some magic way of serving nice-looking URLs in HTML
http://www.example.com/düsseldorf?neighbourhood=Lörick
that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?
Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문
Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.
What Tgr said. Background:
http://www.example.com/düsseldorf?neighbourhood=Lörick
That's not a URI. But it is an IRI.
You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.
To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:
http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick
If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.
Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:
http://en.wikipedia.org/wiki/ɸ
The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...
...well, you know.
Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:
http://stackoverflow.com/questions/2742852/unicode-characters-in-urls
However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:
http://stackoverflow.com/questions/2742852/これは、これを日本語のテキストです
So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...
Not sure if it is a good idea, but as mentioned in other comments and as I interpret it, many Unicode chars are valid in HTML5 URLs.
E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "#", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in a few parts of the parsing algorithm, e.g. for the relative path state:
If c is not a URL code point and not "%", parse error.
Also the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Related: Which characters make a URL invalid?
As all of these comments are true, you should note that as far as ICANN approved Arabic (Persian) and Chinese characters to be registered as Domain Name, all of the browser-making companies (Microsoft, Mozilla, Apple, etc.) have to support Unicode in URLs without any encoding, and those should be searchable by Google, etc.
So this issue will resolve ASAP.
For me this is the correct way, This just worked:
$linker = rawurldecode("$link");
<?php echo $linker ;?>
This worked, and now links are displayed properly:
http://newspaper.annahar.com/article/121638-معرض--جوزف-حرب-في-غاليري-جانين-ربيز-لوحاته-الجدية-تبحث-وتكتشف-وتفرض-الاحترام
Link found on:
http://www.galeriejaninerubeiz.com/newsite/news
Use percent-encoded form. Some (mainly old) computers running Windows XP for example do not support Unicode, but rather ISO encodings. That is the reason percent-encoded URLs were invented. Also, if you give a URL printed on paper to a user, containing characters that cannot be easily typed, that user may have a hard time typing it (or just ignore it). Percent-encoded form can even be used in many of the oldest machines that ever existed (although they don't support internet of course).
There is a downside though, as percent-encoded characters are longer than the original ones, thus possibly resulting in really long URLs. But just try to ignore it, or use a URL shortener (I would recommend goo.gl in this case, which makes a 13-character long URL). Also, if you don't want to register for a Google account, try bit.ly (bit.ly makes slightly longer URLs, with the length being 14 characters).