I can't think of an example, but hopefully you get the idea. Encoded URLs have some characters replaced with those weird %20% type of codes (so I think none of the original characters/meaning is lost) whereas slugs have all the special characters stripped off and white spaces replaced usually with dashes or plus sign ('-' or '+').
The encoding that is used for URLs is not as weired as it may look. It is simply needed to represent characters in URLs that would otherwise not be allowed or inconvenient. A search engine crawler is able to decode them and to get the original meaning back. If you have something like foreign language letters in words that would otherwise be garbled it will very likely make a difference for the search engine. So if you expect to have such words in URLs and if they may be important key words for your site I would suggest to use proper URL encoding in favor of stripping the special characters. Although for simple non letter characters, i.e. the mentioned %20 which is a space character you may continue to use +, - or . as a replacement if you prefer that.
As mentionned earlier, you would have to guess because Google would not give that informations, but one important point is for the user favorites/historic, would you prefer reading
yoursite.com/your-folder/your article
or
yoursite.com/your%20folder/your%20article
this is a simple article name and the first solution look easier to read, image with a name like this question url
Do encoded URLs have better SEO than slugs?
or
Do encoded URLs have better SEO than slugs?
Related
I know the percent symbol has to be URL-encoded when being passed around, but when I display it in the browser, is it also necessary to escape it like so: %?
In URLs, the percent sign (%) has a special meaning, so it should be escaped. In HTML, it does not, so it is not necessary to escape it.
I agree with the chosen answer, but would like to qualify the statement “it is not necessary to escape it.”
If you have a need (or desire) to escape a percentage sign in HTML code, (and there are good reasons to do this with any potentially ambiguous character or symbol) then I would highly recommend using the percentage entity code % as opposed to any numeric code. (those I use when there is no entity name you could use)
That was the answer I was looking for when I found this page, because I forgot it looses the final "e".
We should probably all be using at least the entities kindly listed here. (whoever Webmasterish is; thank you)
Reasoning: Numeric codes (and particularly byte codes from unencoded characters) change with code–pages, on systems using different default languages, and / or different operating systems. (Windows and Mac using slightly different code sets for “English” being the classic, which still plagues plain–text eMail sent between Apple Mail and Outlook) This is slowing down, and should stop with UTF, but I'm still seeing it pop up.
If you're converting HTML to some other mark–up, (note, I used "–" not a "-", or even "−" for the same reason) such as LaTeX, DVI, PostScript or even MarkDown, then it's useful to completely squash any ambiguity… And those processes tend to happen on the information you least expect to be used in such a way when you initially write it. So just get used to doing it everywhere and be grateful to your former self for having had the foresight to do so. Probably years down the line, when you're looking to update formulae to be more readable by utilising MathJax or such, and keep picking up hyphenated words. <swearmarks>
I'd like to add this - if you use javascript in href, you are in troubles too. Check this example:
http://jsfiddle.net/cs4MZ/
One of the workarounds might be using onclick instead of href.
If you're talking about in HTML text, visible to the reader, no. It can't do anything harmful, there.
...if you're talking about inside of HTML attributes, then yes, that would be good to consider.
URLs and HTML are different languages, as weird as that might seem, so they have different weaknesses.
I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.
I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.
Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?
I think it's bad for compatibility and should be replaced by proper
HTML entities.
Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?
Some of those points have been summarised previously:
UTF-8 encodings are easier to read and edit for those who understand
what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings
for those who don't understand them, but they have the advantage of
rendering as special characters rather than hard to understand decimal
or hex encodings.
[For example] Wikipedia... actually go through articles and convert
character entities to their corresponding real characters for the sake
of user-friendliness and searchability.
As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript
encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"
decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で
If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!
HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.
If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.
See http://kunststube.net/frontback for some more information.
When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:
double quote: " is “ and ” (“ and ”)
single quote: ' is ‘ and ’ (‘ and ’)
ellipsis: ... is … (…)
I understand ones that can't be represented without unicode like © and ¢, but even for those, I wonder.
When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
When should you use these unicode equivalents? Are they more semantic than not using them?
Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.
In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.
Are they better interpreted by devices (copy/paste/print)?
Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.
Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).
I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.
Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.
I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.
Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.
I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).
I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).
HTML entities serve a triple purpose:
Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.
Escape characters that have a special meaning in HTML, such as angle brackets.
Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.
Update:
My info is correct but I suspect I've answered the wrong question...
On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.
Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".
The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.
Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.
When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.
HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.
I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.
Within my HTML, can I use the character entity reference " " in place of "%20" in Web URLs?
They're both spaces, right?
The short answer is, they are both used to represent "spaces", but they represent different spaces.
%20 is the URL escaping for byte 32, which corresponds to plain old space in pretty much any encoding you're likely to use in a URL.
is an HTML character reference which actually refers to character 160 of Unicode (and also ISO-8859-1 aka Latin-1). It's a different space character entirely -- the "non-breaking space". Even though they look pretty much the same, they're different characters and it's unlikely that your server will treat them the same way.
No. Neither are spaces (technically). Both represent spaces in different ways though. Make every effort to NOT have spaces, or representatives of spaces, in your URLs. Many find it more elegant (me included) to replace spaces with _ or -
No. is an HTML non-breaking-space entity; this entity has no meaning when used in a filesystem or wherever else that a URL might point. URLs are not encoded in HTML.
No, not in the URLs. What you can do is replace spaces in the textual representation of the URL.
So instead of:
http://some.site/doc%20with%20spaces
you can have:
http://some.site/doc with spaces
%20 is what you get with URL encoding, so this is what you should use if you are going to use it in a URL.
is a HTML entity, which is what should be used for 'non breaking space' in an HTML document.
Most persons try to absolutely avoid spaces in their filenames in URLs. They will give you a serious headache every time so try to do so.
If you want to have spaces in an URL you have to encode them with %20.
  is used by the browser to know how to display the page. This information is only used for displaying. The %20 will be sent to the server that manages all the stuff needed to transfer the webpage to your visitors. The server doesn't speak html so the server would interpret   as a normal part of the filenname and search for a file called in the way foo bar. This file will not be found. Much worse the web server will think that the & begins the variable part of the url and only search for the page foo and then try to generate a variable nbsp and a variable bar but he want see any values for them. All in all the web server can't handle a URL with an in it.
Neither are spaces. You shouldnt be using spaces but if for what ever reason you can't avoid it you should just be able to do...
Hey there
...clicking on which will automatically navigate the user to
WebSite/Web%20Page.aspx
This has been confusing me for some time. With the advent of UTF-8 as the de-facto standard in web development I'm not sure in which situations I'm supposed to use the HTML entities and for which ones should I just use the UTF-8 character. For example,
em dash (–, &emdash;)
ampersand (&, &)
3/4 fraction (¾, ¾)
Please do shed light on this issue. It will be appreciated.
Based on the comments I have received, I looked into this a little further. It seems that currently the best practice is to forgo using HTML entities and use the actual UTF-8 character instead. The reasons listed are as follows:
UTF-8 encodings are easier to read and edit for those who understand what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings for those who don't understand them, but they have the advantage of rendering as special characters rather than hard to understand decimal or hex encodings.
As long as your page's encoding is properly set to UTF-8, you should use the actual character instead of an HTML entity. I read several documents about this topic, but the most helpful were:
UTF-8: The Secret of Character Encoding
Wikipedia Special Characters Help
From the UTF-8: The Secret of Character Encoding article:
Wikipedia is a great case study for an
application that originally used
ISO-8859-1 but switched to UTF-8 when
it became far too cumbersome to support
foreign languages. Bots will now
actually go through articles and
convert character entities to their
corresponding real characters for the
sake of user-friendliness and
searchability.
That article also gives a nice example involving Chinese encoding. Here is the abbreviated example for the sake of laziness:
UTF-8:
這兩個字是甚麼意思
HTML Entities:
這兩個字是甚麼意思
The UTF-8 and HTML entity encodings are both meaningless to me, but at least the UTF-8 encoding is recognizable as a foreign language, and it will render properly in an edit box. The article goes on to say the following about the HTML entity-encoded version:
Extremely inconvenient for those of us
who actually know what character
entities are, totally unintelligible
to poor users who don't! Even the
slightly more user-friendly,
"intelligible" character entities like
θ will leave users who are
uninterested in learning HTML
scratching their heads. On the other
hand, if they see θ in an edit box,
they'll know that it's a special
character, and treat it accordingly,
even if they don't know how to write
that character themselves.
As others have noted, you still have to use HTML entities for reserved XML characters (ampersand, less-than, greater-than).
You don't generally need to use HTML character entities if your editor supports Unicode. Entities can be useful when:
Your keyboard does not support the character you need to type. For example, many keyboards do not have em-dash or the copyright symbol.
Your editor does not support Unicode (very common some years ago, but probably not today).
You want to make it explicit in the source what is happening. For example, the code is clearer than the corresponding white space character.
You need to escape HTML special characters like <, &, or ".
Entities may buy you some compatibility with brain-dead clients that don't understand encodings correctly. I don't believe that includes any current browsers, but you never know what other kinds of programs might be hitting you up.
More useful, though, is that HTML entities protect you from your own errors: if you misconfigure something on the server and you end up serving a page with an HTTP header that says it's ISO-8859-1 and a META tag that says it's UTF-8, at least your —es will always work.
I would not use UTF-8 for characters that are easily confused visually. For example, it is difficult to distinguish an emdash from a minus, or especially a non-breaking space from a space. For these characters, definitely use entities.
For characters that are easily understood visually (such as the chinese examples above), go ahead and use UTF-8 if you like.
Personally I do everything in utf-8 since a long time, however, in an html page, you always need to convert ampersands (&), greater than (>) and lesser then (<) characters to their equivalent entities, &, > and <
Also, if you intend on doing some programming using utf-8 text, there are a few thing to watch for.
XML needs some extra lines to validate when using entities.
Some libraries do not play along nice with utf-8. For instance, PHP in some Linux distributions dropped full support for utf-8 in their regular expression libraries.
It is harder to limit the number of characters in a text that uses html entities, because a single entity uses many characters. Also there's always the risk of cutting the entity in half.
HTML entities are useful when you want to generate content that is going to be included (dynamically) into pages with (several) different encodings. For example, we have white label content that is included both into ISO-8859-1 and UTF-8 encoded web pages...
If character set conversion from/to UTF-8 wasn't such a big unreliable mess (you always stumble over some characters and some tools that don't convert properly), standardizing on UTF-8 would be the way to go.
If your pages are correctly encoded in utf-8 you should have no need for html entities, just use the characters you want directly.
All of the previous answers make sense to me.
In addition: It mostly depends on the editor you intent to use and the document language. As a minimum requirement for the editor is that it supports the document language. That means, that if your text is in japanese, beware of using an editor which does not show them (i.e. no entities for the document itself). If its english, you can even use an old vim-like editor and use entities only for the relative seldom © and friends.
Of course: > for > and other HTML-specials still need escapes.
But even with the other latin-1 languages (german, french etc.) writing ä is a pain in you know where...
In addition, I personally write entities for invisible characters and those which are looking similar to standard-ascii and are therefore easily confused. For example, there is u1173 (looking like a dash in some charsets) or u1175, which looks like the vertical bar. I'd use entities for those in any case.