How can I display ë character in the url? - html

I have an url that contains this ë character. Is there any way so that I can display as ë in front end, but in backend it can be converted to ASCII value %C3%AB of this character. When you view this question particular page url will display ë character. So I want same thing to be display. Thanks in advance for any suggestion.

Well, you'd do good to look at the HTML for this page then:
<a href="/questions/38720183/how-can-i-display-%c3%ab-character-in-the-url">
You must use the correctly URL-encoded version, %c3%ab. The browser may then decide to render it as "ë". That's entirely up to the browser, and it won't do it for all characters, specifically it won't decode particular lookalike characters which can be used to spoof a URL to look identical to another URL but actually be different.

You should use percent-encoding which is
a mechanism for encoding information in a Uniform Resource Identifier
(URI) under certain circumstances. Although it is known as URL
encoding it is, in fact, used more generally within the main Uniform
Resource Identifier (URI) set, which includes both Uniform Resource
Locator (URL) and Uniform Resource Name (URN). As such, it is also
used in the preparation of data of the
application/x-www-form-urlencoded media type, as is often used in the
submission of HTML form data in HTTP requests.
There is a website http://www.url-encode-decode.com/ that will do it for you

Related

Multilingual URLs showing as unicode in breadcrumb menu

I have a Norwegian URL path which looks like this /om-os/bæredygtighed/socialt-ansvar
In my breadcrumb menu, I expect to see something like this:
Om os > Bæredygtighed > Socialt-ansvar
However, the æ is appearing as %c3%a6. So my breadcrumb looks like this:
Om os > B%c3%a6redygtighed > Socialt-ansvar
I have <meta charset="utf-8"> in the head, so I'm unsure why these characters are still appearing?
I don't know how you are building the URLs, but, except for the domains, that have a different encoding, all non-ASCII parts of a URL must be URL-encoded, AKA percent-encoded. The browser does it for you if you don't do it yourself. OTOH, the browser will in most cases show you the unencoded version of your characters. You might not be aware that what is sent over the wire is URL-encoded.
E.g., your path is sent over the wire as /om-os/b%c3%a6redygtighed/socialt-ansvar, even if you see /om-os/bæredygtighed/socialt-ansvar in the address bar. Check it with the developer tools. If you use Firefox, you will have to look at the Headers tab of the HTTP call's details in the Network tab. Chrome, instead, will also show you the HTTP call's summary row URL-encoded. That %c3%a6 in the path is the hex value of the two bytes, C3 and A6, that make up the UTF-8 encoding of the character æ.
You can even set your window.location.pathname programmatically to /om-os/bæredygtighed/socialt-ansvar, but when you read window.location.pathname afterwards, you will get it URL-encoded:
window.location.pathname = '/om-os/bæredygtighed/socialt-ansvar'
[...]
console.log(window.location.pathname)
/om-os/b%C3%A6redygtighed/socialt-ansvar
I don't know how your path flows into your breadcrumbs, but you clearly can reverse the URL-encoding before using your strings.
In JavaScript you normally do that with decodeURIComponent():
console.log(decodeURIComponent('b%c3%a6redygtighed'))
bæredygtighed
console.log(decodeURIComponent('/om-os/b%c3%a6redygtighed/socialt-ansvar'))
/om-os/bæredygtighed/socialt-ansvar
In PHP you normally do that with urldecode:
$decoded = urldecode('b%c3%a6redygtighed'); // will contain 'bæredygtighed'
But it would be better if you could make your data flow in a way that avoids the encoding and decoding steps before reaching your breadcrumbs.
If you have not yet figured out the fix -
just to add on top of whatever walter-tross has already mentioned in above answer -
For the given input - (/om-os/bæredygtighed/socialt-ansvar)
the encodeURI js-method output is as follows -
/om-os/b%C3%A6redygtighed/socialt-ansvar
and the the encodeURIComponent js-method output is as follows -
%2Fom-os%2Fb%C3%A6redygtighed%2Fsocialt-ansvar.
Given the above, it appears that you are fetching the bread-crumb input from the URL. And the behaviour is equivalent to encodeURI method, thus enabling you to split on the '/' character.
The fix, as already noted, would be to perform url-decode using decodeURI or decodeURIComponent on the individual components prior to using it as content.

Valid domain name registration for Unicode characters

I'm trying to figure out what is valid for domain name registration, apparently some Unicode characters are translated weirdly while others do not at all.
This address:
http://xn--ippleman-dmj.com/
Translates to:
http://Nippleman.com/
and
http://xn--ggle-0nda.com/
should translate to:
http://gοοgle.com/
but for some reason the browser prevents it.
How is the format for these domains determined, and what is or isn't blocked by the browser?
http://xn--ippleman-dmj.com/ is a valid URL, while http://www.gοοgle.com is not. Yet Chrome only replaces the Unicode on the second URL.
It appears that you're trying to do an IDN homograph attack. The Wikipedia page nicely explains what Chrome is doing to stop you.
First, to your question.
The valid domain name must conform to RFC1035 regardless of browser, i.e. the whole domain name must not exceed 255 valid ASCII character (in octet) and it is case insensitive. Even IDN must comply with this standard. So to display IDN, RFC evolve come out with the Punycode 'xn--' conversion idea.
Then there is proof of concept of IDN homograph attack. Currently, Unicode.org update and maintain a confusable list. You can download current version TR39 and play around with it.
Previously, Chrome and firefox will translate domain name start with xn-- to correspondence Unicode found inside the browser font cache. If the browser can't find the font, it will display the raw 'xn--' punycode domain name.
This is known issues. Firefox even has manual option to enable/disable the Punycode domain name display. Google decide to remove the conversion post version 58+, while Firefox 53 will follow to make display Punycode as default.
I don't know whether Google will show Unicode(s) not inside TR39 or just remove the Punycode to Unicode conversion for all.

Multiple parameters in URL fragment

Is there a standard format that allows for multiple parameters to be specified in the URI fragment? (the part after the hash #, not the query string.)
The most related information would be this question: Multiple fragment identifiers correct in URL?. The allowed characters for fragments can be found in that question as well.
Would it be acceptable to use, for instance, a semicolon to delimit multiple parameters like this:
http://example.net/page.html?q=1#param1=foo;param2=bar
Are there any unintentional behaviours that I should be aware of with this method? What if there is no such ID in the document with the value param1?
For the purposes of this question, only URIs of HTML resources are considered.
I think you should read this: http://en.wikipedia.org/wiki/Fragment_identifier#Examples
So the de-facto standard format for multiple parameters should be #param1=value1&param2=value2
You can see this way is used by Media Fragments URI 1.0 and by PDF documents. There seems to be no standard for HTML resources though as you can parse the fragment in JavaScript in any way you like. But I'd use the same format as it looks more natural being similar to the query string format. If the browser cannot find any element with id/name equal to your hash fragment, it will navigate to the beginning of the document by default.
Also browsers will consider the complete hash fragment as a possible id/name. So they will look for id/name equal to param1=value1&param2=value2 but not just param1.

What characters are allowed in the HTML Name attribute inside input tag?

I have a PHP script that will generate <input>s dynamically, so I was wondering if I needed to filter any characters in the name attribute.
I know that the name has to start with a letter, but I don't know any other rules. I figure square brackets must be allowed, since PHP uses these to create arrays from form data. How about parentheses? Spaces?
Note, that not all characters are submitted for name attributes of form fields (even when using POST)!
White-space characters are trimmed and inner white-space characters as well the character . are replaced by _.
(Tested in Chrome 23, Firefox 13 and Internet Explorer 9, all Win7.)
Any character you can include in an [X]HTML file is fine to put in an <input name>. As Allain's comment says, <input name> is defined as containing CDATA, so the only things you can't put in there are the control codes and invalid codepoints that the underlying standard (SGML or XML) disallows.
Allain quoted W3 from the HTML4 spec:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire ISO10646 character set.
However this isn't really true in practice.
The theory is that application/x-www-form-urlencoded data doesn't have a mechanism to specify an encoding for the form's names or values, so using non-ASCII characters in either is “not specified” as working and you should use POSTed multipart/form-data instead.
Unfortunately, in the real world, no browser specifies an encoding for fields even when it theoretically could, in the subpart headers of a multipart/form-data POST request body. (I believe Mozilla tried to implement it once, but backed out as it broke servers.)
And no browser implements the astonishingly complex and ugly RFC2231 standard that would be necessary to insert encoded non-ASCII field names into the multipart's subpart headers. In any case, the HTML spec that defines multipart/form-data doesn't directly say that RFC2231 should be used, and, again, it would break servers if you tried.
So the reality of the situation is there is no way to know what encoding is being used for the names and values in a form submission, no matter what type of form it is. What browsers will do with field names and values that contain non-ASCII characters is the same for GET and both types of POST form: it encodes them using the encoding the page containing the form used. Non-ASCII GET form names are no more broken than everything else.
DLH:
So name has a different data type for than it does for other elements?
Actually the only element whose name attribute is not CDATA is <meta>. See the HTML4 spec's attribute list for all the different uses of name; it's an overloaded attribute name, having many different meanings on the different elements. This is generally considered a bad thing.
However, typically these days you would avoid name except on form fields (where it's a control name) and param (where it's a plugin-specific parameter identifier). That's only two meanings to grapple with. The old-school use of name for identifying elements like <form> or <a> on the page should be avoided (use id instead).
The only real restriction on what characters can appear in form control names is when a form is submitted with GET
"The "get" method restricts form data set values to ASCII characters." reference
There's a good thread on it here.
While Allain's comment did answer OP's direct question and bobince provided some brilliant in-depth information, I believe many people come here seeking answer to more specific question: "Can I use a dot character in form's input name attribute?"
As this thread came up as first result when I searched for this knowledge I guessed I may as well share what I found.
Firstly, Matthias' claimed that:
character . are replaced by _
This is untrue. I don't know if browser's actually did this kind of operation back in 2013 - though, I doubt that. Browsers send dot characters as they are(talking about POST data)! You can check it in developer tools of any decent browser.
Please, notice that tiny little comment by abluejelly, that probably is missed by many:
I'd like to note that this is a server-specific thing, not a browser thing. Tested on Win7 FF3/3.5/31, IE5/7/8/9/10/Edge, Chrome39, and Safari Windows 5, and all of them sent " test this.stuff" (four leading spaces) as the name in POST to the ASP.NET dev server bundled with VS2012.
I checked it with Apache HTTP server(v2.4.25) and indeed input name like "foo.bar" is changed to "foo_bar". But in a name like "foo[foo.bar]" that dot is not replaced by _!
My conclusion: You can use dots but I wouldn't use it as this may lead to some unexpected behaviours depending on HTTP server used.
Do you mean the id and name attributes of the HTML input tag?
If so, I'd be very tempted to restrict (or convert) allowed "input" name characters into only a-z (A-Z), 0-9 and a limited range of punctuation (".", ",", etc.), if only to limit the potential for XSS exploits, etc.
Additionally, why let the user control any aspect of the input tag? (Might it not ultimately be easier from a validation perspective to keep the input tag names are 'custom_1', 'custom_2', etc. and then map these as required.)

Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?
Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.
My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.
All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:
URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
HTTP Client libraries
Exotic browsers, RSS readers
Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?
Is there some magic way of serving nice-looking URLs in HTML
http://www.example.com/düsseldorf?neighbourhood=Lörick
that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?
Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문
Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.
What Tgr said. Background:
http://www.example.com/düsseldorf?neighbourhood=Lörick
That's not a URI. But it is an IRI.
You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.
To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:
http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick
If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.
Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:
http://en.wikipedia.org/wiki/ɸ
The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...
...well, you know.
Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:
http://stackoverflow.com/questions/2742852/unicode-characters-in-urls
However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:
http://stackoverflow.com/questions/2742852/これは、これを日本語のテキストです
So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...
Not sure if it is a good idea, but as mentioned in other comments and as I interpret it, many Unicode chars are valid in HTML5 URLs.
E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "#", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in a few parts of the parsing algorithm, e.g. for the relative path state:
If c is not a URL code point and not "%", parse error.
Also the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Related: Which characters make a URL invalid?
As all of these comments are true, you should note that as far as ICANN approved Arabic (Persian) and Chinese characters to be registered as Domain Name, all of the browser-making companies (Microsoft, Mozilla, Apple, etc.) have to support Unicode in URLs without any encoding, and those should be searchable by Google, etc.
So this issue will resolve ASAP.
For me this is the correct way, This just worked:
$linker = rawurldecode("$link");
<?php echo $linker ;?>
This worked, and now links are displayed properly:
http://newspaper.annahar.com/article/121638-معرض--جوزف-حرب-في-غاليري-جانين-ربيز-لوحاته-الجدية-تبحث-وتكتشف-وتفرض-الاحترام
Link found on:
http://www.galeriejaninerubeiz.com/newsite/news
Use percent-encoded form. Some (mainly old) computers running Windows XP for example do not support Unicode, but rather ISO encodings. That is the reason percent-encoded URLs were invented. Also, if you give a URL printed on paper to a user, containing characters that cannot be easily typed, that user may have a hard time typing it (or just ignore it). Percent-encoded form can even be used in many of the oldest machines that ever existed (although they don't support internet of course).
There is a downside though, as percent-encoded characters are longer than the original ones, thus possibly resulting in really long URLs. But just try to ignore it, or use a URL shortener (I would recommend goo.gl in this case, which makes a 13-character long URL). Also, if you don't want to register for a Google account, try bit.ly (bit.ly makes slightly longer URLs, with the length being 14 characters).