I need to put a link with this href="file://attachments/aaaa_#_aaaa.msg"
Obviously in that way is not working because the hash character # is used for anchors.
So I try to change this to: href="file://attachments/aaaa_%23_aaaa.msg"
but when I open the url in the IE, the browser is trying to open this: href="file://attachments/aaaa_%2523_aaaa.msg"
IE is encoding the % character to %25
How can I put the file name in the URL to encode and read the hash character # in all the browsers to download the file?
I can't change the file name to remove this character, so I need a way to deal with this problem.
You will avoid lots and lots and lots of pain if you are able to rename your files so they don't contain a "#" character. As long as they do, you will probably have current and future cross-browser issues, confusion on behalf of future developers working on your code (or confusion on your behalf in the future, when you've forgotten the ins and outs of the encoding), etc. Also, some Unix/Linux systems don't allow "#" in filenames. Not sure what OS you're using, but your filenames should be as portable as possible across OSs, even if you're "sure" right now that you'll never be running on one of those systems.
Related
Is there any particular reason I should use HTML symbol entities instead of the actual symbol (I mean the one which I can just type)? For example the symbol /; the HTML entity code for it is /.
Should I use the symbol's code or the symbol itself in my HTML code, and why?
Using an HTML entity reference allows the entity to be represented as intended regardless of the encoding applied to the document. That is the benefit.
Rather than strictly using entities for all non-US-ASCII characters, feel free to use an encoding for your document that supports the document's target language, preferably one also supporting other languages, like UTF-8.
However, please avoid using any system-specific encoding, especially regular Windows encoding. It is often the case that Windows-1252 text is sent to other systems with the wrong label of ISO-8859-1.
In the past there has certainly been been less reliable support for numeric HTML entities than for named HTML entities (based on my own first-person eye witness observation), but in theory a numeric HTML entity is still character encoding independent and "safe" because the numeric value refers directly to a code point registered in the UCS (http://en.wikipedia.org/wiki/Universal_Character_Set) and equivalent to its defined character name.
Caveat: the following describes my own experience, and yours may vary.
HTML documents transferred by clients for me to work on with symbols directly embedded are very often corrupted and cannot be recovered. This may be a weakness of U.S. infrastructure or a lack of knowledge on the part of my customers about how to send their documents. The infrastructure and people in a country whose primary language relies on non-ASCII characters would be much more likely to support and understand how to properly transfer their documents with no corruption.
If you are developing your own website and uploading the final copies of your own files to your server, then the risk of corruption is very small.
If you do not have control over your document from the point you edit it to the point that it is served to users, then you run the risk (perhaps not today, but certainly within recent years in the U.S., a likelihood more than mere risk) of having the document improperly converted at some point along the way and being permanently corrupted regardless of what encoding you attempt to view it in.
No.
Entities and character references are useful only if:
The character has special meaning in HTML at the point where you want to use the character (/ never will, it only has special meaning in places where you can't have a / as data anyway).
You can't type the character (e.g. because it doesn't appear on your keyboard).
You can't encode the file as UTF-8 (or in another encoding that includes it … and / appears in ASCII).
Unless you know for a fact that you will always be using the same software and computer system to edit your HTML, you will inevitably run into situations where you cannot edit your own code if you directly use symbols, regardless of what character encoding you specify in your document or with your HTTP headers. Only in a perfect world does the character encoding always properly transfer, and even then neither Macintosh nor Windows truly does it correctly.
If I open up a supposedly "properly" encoded document from either Macintosh or Windows in software that truly supports all available encoding systems, I see a message like this:
-=-J(DOS)**--F1 Top L3 (Text) ----------------------------------------
These default coding systems were tried to encode text
in the buffer:
(iso-2022-7bit-dos (284 . 4194194) (379 . 4194194) (462 . 4194195)
(492 . 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
. 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
(utf-8-dos (284 . 4194194) (379 . 4194194) (462 . 4194195) (492
. 4194196) (635 . 4194195) (640 . 4194196) (642 . 4194195) (772
. 4194196) (833 . 4194195) (839 . 4194196) (857 . 4194195))
However, each of them encountered characters it couldn't encode:
iso-2022-7bit-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
utf-8-dos cannot encode these: \222 \222 \223 \224 \223 \224 \223 \224 \223 \224 ...
Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.
Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).
thai-tis620
Remember that as soon as the data is off of your server, e.g., placed in an email, etc., there is no guarantee the encoding is passed along, and chances are that it is not. Byte marks and other invisible means of identifying documents do not work as promised, let alone transient methods such as HTTP headers which are lost as soon as the document moves beyond the context of your own carefully configured HTTP server.
The guiding principle of HTML is that it is a plain text markup language that, when properly used, is universally compatible with any system supporting the most basic of text. HTML documents should use HTML entities for any characters outside of the normal 7-bit US-ASCII character set. Any other characters have different binary definitions depending on the encoding used and may even vary between single-byte and multi-byte representations.
Within Non-HTML documents you can feel free to use raw symbols because when you embed them within either their native file format or within HTML you can ensure that you specify the "right" character encoding, i.e., the one that will be recognized by the system where you authored it and any system compatible with that.
I noticed that when compiling a chm file, if the tree contains any articles that have a double quote in its title such as How to use the "?" correctly, it won't process the tree properly. In the actual help file, the title would actually read How to use the.
Is there a problem with Windows Help Files, in its ability to process quotes? Or, do I have to specify character encoding somewhere to get around this issue?
Thank you.
Afaik topic titles are wchars in the index, and the rest is simply html. So probably a limitation of the CHM compiler and not of the format.
Unfortunately, there is not much that you can do about that.
OTOH html afaik requires " to be replaced by & quot; did you try that ?
P.s. the ampersand and quote must touch of course, but the forum software would then turn it into a "
In the article Better web typography in a few simple steps, it says
Talking about apostrophes, the correct sign for them is the right single quotation mark. A dead give-away for amateur typography is the presence of straight quotation marks, also called 'dumb quotes' by type-savvy designers.
I've been using these "dumb quotes" all along!
Now, when one is writing regular HTML (and not Markdown, which automatically produces apostrophes), how is one supposed to sanely write correct apostrophes? Am I just supposed to inject ’ wherever a ' would go before? Is there a program that automatically does this?
How do professional web designers take care of this problem?
You have couple of options here:
As was pointed out before, either use numerical or named HTML entities.
Write your HTML with single apostrophes and then do a search and replace before publishing. This is workable, but could lead to unexpected replacements if you aren’t careful.
Insert the actual single quote using the appropriate keyboard sequence for your operating system: option-shift-] on a Mac or alt-0146 on a PC and make sure to save and serve your HTML as UTF-8 encoded. That way you don't have to screw around with entity names, but asumes a UTF-8 clean workflow.
We are currently working on a I18N project. I am wondering what are the complications of having the non-ascii characters in the URL. If its not advisable, what are the alternatives to deal with this problem?
EDIT (in response to Maxym's answer):
The site is going to be local to specific country and I need not worry about the world wide public accessing this site. I understand that from usability point of view, It is really annoying. What are the other technical problem associated with this?
It is possible to use non-ASCII/non-Latin domain names using IDNA. Further, you can always use percent encoding (like %20 for space) in URLs. RFC 3986 recommends UTF-8 encoding combined with percents:
the data should first be encoded as
octets according to the UTF-8
character encoding; then only those
octets that do not correspond to
characters in the unreserved set
should be percent-encoded. (...) For
example, the character A would be
represented as "A", the character
LATIN CAPITAL LETTER A WITH GRAVE
would be represented as "%C3%80", and
the character KATAKANA LETTER A would
be represented as "%E3%82%A2".
Modern clients (web browsers) are able to transform back and forth between percent encoding and Unicode, so the URL is transferred as ASCII but looks pretty for the user.
Make sure you're using a web framework/CMS that understands this encoding as well, to simplify URL input from webmasters/content editors.
I would say no. The reason is simple -> if you rely on world wide public, then it would be a big problem for people to type your url. I live in "cyrillic" world, it is possible to create cyrillic urls, but no one succeed with that, because even we are pretty lazy to change language and get used to type latin...
Update:
I can't say about alternatives, but sometimes some languages have informal or formal letter substitute, e.g. in German you can write Ö but in url you could see OE instead. Also you can consider english words, or words with similar sounds (so people from your country can remeber that writing, and other "countries" won't harm
depends on the target users... for example Nürnberg.de also looks at nuernberg.de for sake to make it easily accessible for native German user(as German keyboard is default and has all 4 extra key-symbols (öäüß) avaible to all German speakers), and do not forget that one of the goal I18N is to provide native language feel to the end user. Mac and Linux user have even more initiative way, like by clicking Alt+u on Mac will induce umlaut in characters to deal with I18N inputing.
I was just wondering what are the
complications of having the non-ascii
characters in the URL.
but the way you laid your question, it seems that your question is more around URI, rather then URL... and you are trying to fuse URN with non-ascii characters inside URI. there are no complications in it, if you know where and how to parse the your URN at server ( for example: in case of Django based server, the URN can be parsed and handled using regex inside url.py ).. all you need to keep in mind is that with web2.0( Ajax javascript based) evolution, everything mainly runs in utf-8, as Javascript specification demands utf-8 encoding. And thus utf-8 has evolving into a sort of standard. stick with utf-8 encoding specs, and you will hardly be facing any complications in URI parsing and working around it.
for example. check the URI http://de.wikipedia.org/wiki/Fürth or http://hi.wikipedia.org/wiki/जर्मनी .. irrespective of the encoding you write it in addressbar, browser will translate it to UTF-8, and send it to server.
NOTE : beside UTF-8, there are some symbols that are encoded using percentage encoding.. more about it can be located here...
http://en.wikipedia.org/wiki/Percent-encoding
You can use non-ascii characters in an url, but it's ugly because spécial caracters must be encoded like this:
http://www.w3schools.com/tags/ref_urlencode.asp
When I save a file with an .htm or .html extension, which one is correct and what is different?
Neither is wrong, it's a matter of preference. Traditionally, MS software uses htm by default, and *nix prefers html.
As oded pointed out below, the .htm tradition was carried over from win 3.xx, where file extensions were limited to three characters.
Mainly, the number of characters is different.
".htm" smells of Microsoft operating systems where the file system historically limited file name extensions (the part of the file name after the dot) to 3 characters.
".html" smells of Un*x operating systems that did not have this limitation and that were used for all the serious internet work at the time.
Pragmatically, the two are equivalent.
The difference is cultural. ".html" is regarded by some as more correct. The same people tend to look down at Microsoft operating systems and regard ".htm" as unsightly reminder of their limitations.
When you save the file locally, the difference doesn't matter - your local system will likely treat the two file extensions as interchangeable for loading by your browser. The reason for it is that historically Windows-based systems used 3 letter extensions (htm) and Unix-based systems the 4 letters (html).
On a server-side, there may be some differences when it comes to serving default filenames:
The one situation in which there may be a difference between the two extensions is that of a server's default filenames. When a URL that does not specify a filename is requested from a server, such as http://www.domain.dom/dirname/, the server returns a file from the requested URL that matches a default filename. Examples of common default filenames include "index.html," "index.htm," "default.html," "default.htm," etc. However, an administrator can make the server's default filename anything he/she so desires.
Note that servers are often configured with more then one default filename.
So if you have any level of control over your server's default filenames, then this shouldn't be an issue.
Personally I prefer the .html but as other have said both will work.
Just make sure you only use one. Never both on the same site!
link to mypage.html is not the same as link to mypage.htm
Also notice that as part of a URI, the file extension doesn't play any role. In fact, it isn't even a file extension, it just looks like one. The type of the resource identified by a URI is not encoded in its name. Instead, it is decided by the Content-Type HTTP header field. It's completely legitimate (but perhaps a bit stupid) to deliver a bitmap picture as myimage.html and conversely, to deliver an HTML page as index.png. This is also the reason why it is argued that file extensions shouldn't be part of URIs at all.
Sir Tim Berners-Lee elaborates on this in Hypertext Style: Cool URIs Don't Change.
They are completely interchangeable. If I understand the history properly then in the beginning the correct extension was .html but when Windows 95 came along it could only cope with 3 character extensions.
So .html is correct according to some standard or other but in practice it doesn't matter (most of the time...have just done a quick google search and found the following)
There is one area of concern though, most host servers will require your default starting page to be named as "index.html" and not as "index.htm"
I use .htm. Less typing I guess. Or perhaps it's my windows-bias.
Both are correct back in the past file extensions had to be a maximum of 3 characters long.
http://en.wikipedia.org/wiki/Filename_extension
Personally I prefer .html, since the name is "Hypertext markup language". .htm was used because certain legacy versions of windows could not have more than 3 characters in the file name extension
Both are working as same,but For the technical and non technical reference please find out here,
http://www.sightspecific.com/~mosh/www_faq/ext.html