Forward slashes in http protocol declaration in URL - html

I have just noticed that on HTML form validation for an input type of url that the double forward slash '//' after the protocol: is not required. I tried entering URLs in to many browsers without the forward slashes and they all work e.g. http:www.web-dewd.com works in Chrome, Firefox, Edge, Opera, and dare I say it, even IE11.
I cannot find any specific definition which states whether they are required or not. I spent a good few minutes on https://www.w3.org/standards/ without any luck. The best I could find was an interview with Tim Berners-Lee stating they are not required: http://www.dailymail.co.uk/sciencetech/article-1220286/Sir-Tim-Berners-Lee-admits-forward-slashes-web-address-mistake.html :
But with the colon in there as well, it turns out people never use the slash slash...
This article from ZDNet states:
there is practically no reference to the double forward-slash on the web
I would argue that slashes are recommended, but does anyone know and is able to provide evidence of what the correct standard is?
Somewhat ironically, Stackoverflow does require // when entering a link, as do other editors when determining to convert text to a link e.g. Microsoft Outlook.

Source
PrePrefix: To be a Uniform Resource Locator as currently defined by the URI
working group, the whole string must start with a constant prefix
"URL:"
this part says that valid URL starts with protocol: and does not states anything about //
Internet protocol parts Those schemes which refer to internet protocols mostly have a
common syntax for the rest of the object name. This starts with a
double slash "//" to indicate its presence, and continues until the
following slash "/".
To indicate URL string must start with protocol: and // is just common syntax to indicate domain name start.
When replacing URL usually you look for http[s]:// instead of http[s]:. It's just common practice, and does not mean that all web developers will use that.

Related

What's the difference between the two css link below, and what are the purposes of the two links? [duplicate]

I just learned from a colleague that omitting the "http | https" part of a URL in a link will make that URL use whatever scheme the page it's on uses.
So for example, if my page is accessed at http://www.example.com and I have a link (notice the '//' at the front):
Google
That link will go to http://www.google.com.
But if I access the page at https://www.example.com with the same link, it will go to https://www.google.com
I wanted to look online for more information about this, but I'm having trouble thinking of a good search phrase. If I search for "URLs without HTTP" the pages returned are about urls with this form: "www.example.com", which is not what I'm looking for.
Would you call that a schemeless URL? A protocol-less URL?
Does this work in all browsers? I tested it in FF and IE 8 and it worked in both. Is this part of a standard, or should I test more browsers?
Protocol relative URL
You may receive unusual security warnings in some browsers.
See also, Wikipedia Protocol-relative URLs for a brief definition.
At one time, it was recommended; but going forward, it should be avoided.
See also the Stack Overflow question Why use protocol-relative URLs at all?.
It is called network-path reference (the part that is missing is called scheme or protocol) defined in RFC3986 Section 4.2
4.2 Relative Reference
A relative reference takes advantage of the hierarchical syntax
(Section 1.2.3) to express a URI reference relative to the name space
of another hierarchical URI.
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
The URI referred to by a relative reference, also known as the target URI, is obtained by applying the reference resolution
algorithm of Section 5.
A relative reference that begins with two slash characters is
termed a network-path reference (emphasis mine); such references are rarely used.
A relative reference that begins with a single slash character is termed an absolute-path reference. A relative reference that does not begin with a slash character is termed a relative-path reference.
A path segment that contains a colon character (e.g., "this:that") cannot be used as the first segment of a relative-path reference, as it would be mistaken for a scheme name. Such a segment must be preceded by a dot-segment (e.g., "./this:that") to make a relative- path reference.

What is the difference between these URL syntax?

I was sent a hyperlink to a Tableau Public link by a client. When I tried opening it, I got a 404 exception. I wrote back to the client but was told by the same that the link was working fine. I visited his profile page and was able to open the presentation there, but the URL that ended up working was slightly different than the one behind the original, non-functioning link.
Here's the anonymized URL behind the original link
https://public.tableau.com/profile/[client_name]%23!/vizhome/Project-AirportDelay/FlightPerformancesinUSA?publish=yes
And here's the URL via the profile page:
https://public.tableau.com/profile/[client_name]#!/vizhome/Project-AirportDelay/FlightPerformancesinUSA
The only differences I see are ?publish=yes and %23!. I tried appending the former, ?publish=yes, to the working URL, and it was still functional. So I suspect that it has to do with the other difference %23! vs. #!. Could the first work because he is opening it from his computer where he is likely logged onto Tableau Public? What's the difference between these syntax? Any ideas about why the original hyperlink might not be functional?
For obvious privacy reasons, I can't provide the whole URL.
It looks like the basic URL pattern for passing filters ?publish=yes
and
%23 is the URL encoded representation of #
The first # after the authority component starts the fragment component. If the # should be part of the path component or the query component, it has to be percent-encoded as %23.
As # is a reserved character, these URIs aren’t equivalent:
http://example.com/foo#bar
http://example.com/foo%23bar
There are countless ways how a URI reference could become erroneous. The culprit is often a software, like a word processor, where someone pastes the correct URI, and the software incorrectly percent-encodes it (maybe assuming that the user didn’t paste the real/correct URI).
Copy-pasting the URI from the browser address bar into a plain text document should always work correctly.

Valid domain name registration for Unicode characters

I'm trying to figure out what is valid for domain name registration, apparently some Unicode characters are translated weirdly while others do not at all.
This address:
http://xn--ippleman-dmj.com/
Translates to:
http://Nippleman.com/
and
http://xn--ggle-0nda.com/
should translate to:
http://gοοgle.com/
but for some reason the browser prevents it.
How is the format for these domains determined, and what is or isn't blocked by the browser?
http://xn--ippleman-dmj.com/ is a valid URL, while http://www.gοοgle.com is not. Yet Chrome only replaces the Unicode on the second URL.
It appears that you're trying to do an IDN homograph attack. The Wikipedia page nicely explains what Chrome is doing to stop you.
First, to your question.
The valid domain name must conform to RFC1035 regardless of browser, i.e. the whole domain name must not exceed 255 valid ASCII character (in octet) and it is case insensitive. Even IDN must comply with this standard. So to display IDN, RFC evolve come out with the Punycode 'xn--' conversion idea.
Then there is proof of concept of IDN homograph attack. Currently, Unicode.org update and maintain a confusable list. You can download current version TR39 and play around with it.
Previously, Chrome and firefox will translate domain name start with xn-- to correspondence Unicode found inside the browser font cache. If the browser can't find the font, it will display the raw 'xn--' punycode domain name.
This is known issues. Firefox even has manual option to enable/disable the Punycode domain name display. Google decide to remove the conversion post version 58+, while Firefox 53 will follow to make display Punycode as default.
I don't know whether Google will show Unicode(s) not inside TR39 or just remove the Punycode to Unicode conversion for all.

How do browsers interpret hrefs that start with "http:/"?

Some of my users are creating links that look like
<a href='http:/some_local_path'>whatever</a>
I've noticed that Firefox interprets this as
<a href='/some_local_path'>whatever</a>
Can I count on this occurring in all browsers? Or should I remove the http:/s myself?
This is an unusual URL, but is not invalid. The URL spec says that omitted components are defaulted from the base url, which can be provided explicitly in a <base> tag, or absent that, the current URL of the page.
When a browser sees /some_local_path, it is missing a scheme and a host, so it takes them from the base url. When your users put http:/some_local_path, it has an explicit scheme, but is missing a host, so the host defaults to the base url. If your page is an http: page, then the two URLs will be interpreted identically.
All that said, those URLs are almost certainly not what your users intended. You'll be helping them if you point out their error.
It's always best to validate data entered by users. Inevitably, you'll get something unexpected.

Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal?
Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant.
My main point, though, is serving the unencoded characters for the sole purpose of having nice-looking URLs, so percent encoding is out.
All major browsers seem to be parsing those URLs okay no matter what the RFC says. My general impression, though, is that it gets very shaky when leaving the domain of web browsers:
URLs getting copy+pasted into text files, E-Mails, even Web sites with a different encoding
HTTP Client libraries
Exotic browsers, RSS readers
Is my impression correct that trouble is to be expected here, and thus it's not a practical solution (yet) if you're serving a non-technical audience and it's important that all your links work properly even if quoted and passed on?
Is there some magic way of serving nice-looking URLs in HTML
http://www.example.com/düsseldorf?neighbourhood=Lörick
that can be copy+pasted with the special characters intact, but work correctly when re-used in older clients?
Use percent encoding. Modern browsers will take care of display & paste issues and make it human-readable. E. g. http://ko.wikipedia.org/wiki/위키백과:대문
Edit: when you copy such an url in Firefox, the clipboard will hold the percent-encoded form (which is usually a good thing), but if you copy only a part of it, it will remain unencoded.
What Tgr said. Background:
http://www.example.com/düsseldorf?neighbourhood=Lörick
That's not a URI. But it is an IRI.
You can't include an IRI in an HTML4 document; the type of attributes like href is defined as URI and not IRI. Some browsers will handle an IRI here anyway, but it's not really a good idea.
To encode an IRI into a URI, take the path and query parts, UTF-8-encode them then percent-encode the non-ASCII bytes:
http://www.example.com/d%C3%BCsseldorf?neighbourhood=L%C3%B6rick
If there are non-ASCII characters in the hostname part of the IRI, eg. http://例え.テスト/, they have be encoded using Punycode instead.
Now you have a URI. It's an ugly URI. But most browsers will hide that for you: copy and paste it into the address bar or follow it in a link and you'll see it displayed with the original Unicode characters. Wikipedia have been using this for years, eg.:
http://en.wikipedia.org/wiki/ɸ
The one browser whose behaviour is unpredictable and doesn't always display the pretty IRI version is...
...well, you know.
Depending on your URL scheme, you can make the UTF-8 encoded part "not important". For example, if you look at Stack Overflow URLs, they're of the following form:
http://stackoverflow.com/questions/2742852/unicode-characters-in-urls
However, the server doesn't actually care if you get the part after the identifier wrong, so this also works:
http://stackoverflow.com/questions/2742852/これは、これを日本語のテキストです
So if you had a layout like this, then you could potentially use UTF-8 in the part after the identifier and it wouldn't really matter if it got garbled. Of course this probably only works in somewhat specialised circumstances...
Not sure if it is a good idea, but as mentioned in other comments and as I interpret it, many Unicode chars are valid in HTML5 URLs.
E.g., href docs say http://www.w3.org/TR/html5/links.html#attr-hyperlink-href:
The href attribute on a and area elements must have a value that is a valid URL potentially surrounded by spaces.
Then the definition of "valid URL" points to http://url.spec.whatwg.org/, which defines URL code points as:
ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "#", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.
The term "URL code points" is then used in a few parts of the parsing algorithm, e.g. for the relative path state:
If c is not a URL code point and not "%", parse error.
Also the validator http://validator.w3.org/ passes for URLs like "你好", and does not pass for URLs with characters like spaces "a b"
Related: Which characters make a URL invalid?
As all of these comments are true, you should note that as far as ICANN approved Arabic (Persian) and Chinese characters to be registered as Domain Name, all of the browser-making companies (Microsoft, Mozilla, Apple, etc.) have to support Unicode in URLs without any encoding, and those should be searchable by Google, etc.
So this issue will resolve ASAP.
For me this is the correct way, This just worked:
$linker = rawurldecode("$link");
<?php echo $linker ;?>
This worked, and now links are displayed properly:
http://newspaper.annahar.com/article/121638-معرض--جوزف-حرب-في-غاليري-جانين-ربيز-لوحاته-الجدية-تبحث-وتكتشف-وتفرض-الاحترام
Link found on:
http://www.galeriejaninerubeiz.com/newsite/news
Use percent-encoded form. Some (mainly old) computers running Windows XP for example do not support Unicode, but rather ISO encodings. That is the reason percent-encoded URLs were invented. Also, if you give a URL printed on paper to a user, containing characters that cannot be easily typed, that user may have a hard time typing it (or just ignore it). Percent-encoded form can even be used in many of the oldest machines that ever existed (although they don't support internet of course).
There is a downside though, as percent-encoded characters are longer than the original ones, thus possibly resulting in really long URLs. But just try to ignore it, or use a URL shortener (I would recommend goo.gl in this case, which makes a 13-character long URL). Also, if you don't want to register for a Google account, try bit.ly (bit.ly makes slightly longer URLs, with the length being 14 characters).