I'm trying to get special characters into HTML, and am not sure if this is even possible. If anyone remembers Kroz, or just about every DOS interface - there is a special set of shape characters. I'm wanting to use the single braces, double braces, shadows, and other shape characters, but I can't seem to track any of these down anywhere.
Also, will using these characters in an HTML environment present any localization concerns / will there be a required charset?
Thanks!
There is no “extended ASCII”; ASCII ends at code position 127 decimal, 7F hexadecimal. What is called “extended ASCII” is a set of mutually incompatible 8-bit encodings that contain the printable ASCII characters in the same positions as in ASCII. In your case, you seem to want to use the Code Page 437. All of its characters exist in Unicode. You can find the correspondence at
http://en.wikipedia.org/wiki/Code_page_437
which I believe to be correct in this issue; but the authoritative reference is
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
There are various ways to enter the characters. You can use, say, “▓” as such in HTML, if you have some way of entering it and you use UTF-8 on the page. Alternatively, you can use character references like ▓.
Yes, similar characters exist in the UTF-8 character set. These are called box drawing characters.
See: http://www.fileformat.info/info/unicode/block/box_drawing/utf8test.htm
Related
Is there a good rule of thumb for when to use decimal vs. hexadecimal notation for HTML entities?
For example, a non-breaking hyphen is written in decimal as ‑ and in hex as ‑.
This answer says that hexadecimal is for Unicode; does that mean hex should be used if you're using the <meta charset="utf-8"> tag in the document <head>?
Occasionally, I will notice entity characters mistakenly rendered instead of the entities they represent -- for example, & appearing (instead of an ampersand) in an email subject line or RSS headline. Is either hex or decimal better for avoiding this?
One last consideration: can using hex or decimal affect the rendering clarity (crispness) of the character?
The rule of thumb is: use whichever you prefer, but prefer hex. ☺
There is no difference in meaning and no difference in browser support (the last browsers that supported decimal references only died in the 1990s).
As #AlexW describes, hexadecimal references are more natural than decimal, due to the way character code standards are written. But if you find decimal references more convenient, use them.
The issue has nothing to with meta tags and character encodings. The main reason why character references were introduced into HTML is that they let you enter characters quite independently of the encoding of the document. This includes characters that cannot be directly written at all in the encoding used. Thanks to them, you can enter any Unicode character even if the character encoding is ASCII or some other limited encoding, like ISO-8859-1.
In the old days, it was common to recommend the use of named references (or “entity references” as they are formally called in classic HTML), when possible, because a reference like Ω, when displayed literally to the user, is more understandable than a reference like Ω or Ω. This hasn’t been relevant for over a decade, as far as web browsers are considered. But e.g. e-mail clients might be kind of stupid^H^H^H^H^H^H^H^H^H underdeveloped in this respect. They might e.g. show references as such in a list of messages, even though they can intepret them properly when viewing a message. But there does not seem to be any consistent behavior that you could count on.
Overall
HTML (and XML) offers three ways to encode special characters: numeric hex &, numeric decimal & (aka "character references"), and named & (aka "entity references"). They've remained equally valid and fully supported by all major browsers for decades. They work with any encoding, but always render from the Unicode set (which is compatible with ASCII, ISO Latin, and Windows Latin, minus codes 128-159).
So it's up to personal preference, with a few things worth noting.
Necessity
If you add the proper charset meta tag to your HTML, you don't need to encode special characters at all (except & < > " ', or more generally, just & < in loose text). The exception is wanting to encode a character not present in the specified encoding. But if you use UTF-8, you can represent anything from Unicode anyway.
Brevity
For any character below index 10, decimal is shorter. A tab is 	, versus 	, so it may be worth it for pre tags containing a lot of TSV data, for example.
Ease of Use
Named references are the easiest to use and memorize, especially for code shared among developers of different backgrounds and skill sets. < is much more intuitive than <. As for someone else's comment regarding relevance, they're actually still fully supported as part of the W3C standard, and have even been expanded on for HTML5.
Best Practice
Using named or decimal references may not be the best general practice since the names are English-only, and unique to HTML (even XML lacks named references, minus the "big five"). Most programming languages and character tables use hex encoding, so it makes things easier and more portable in the long run when you stay consistent. Though for small projects or special cases, it may not really matter.
More info: http://xmlnews.org/docs/xml-basics.html#references
These are called numeric character references. They are derived from SGML and the numeric portion of them references the specific Unicode code point of the character you are trying to display. They allow you to represent characters of Unicode, even if the particular character set you wrote the HTML in doesn't have the character you are referencing. Whether you reference the code point with decimal or hexidecimal does not matter, except for very old browsers that prefer decimal. Hexidecimal support was added because Unicode code points are referenced in hex notation and it makes it much easier to look up the code point and then add the reference, without having to convert to decimal:
U+007D
=
}
To answer your question:
This answer says that hexadecimal is for Unicode; does that mean hex
should be used if you're using the <meta charset="utf-8"> tag in the
document ?
You have to understand that UTF-8 is backwards-compatible with ASCII / ISO-8859-1. So the first 256 characters of UTF-8 will be the same in ASCII and UTF-8. Hex is just easier for UTF-8 because, as of 2013 there are 1,114,112 Unicode code points. So it's easier to write � than it is to write � etc.
I want to insert the Arabic letters in the pattern just like the English letters
pattern="[a-zA-Z0-9-_. ]{1,30}"
I have no idea how to accomplish the action.
The range for Arabic and Persian are shared so this code could be used for Arabic too.
[أ-يa-zA-Z]
This is the reference for finding the character range of Unicode languages:
preg_replace and preg_match arabic characters
http://unicode.org/charts/
The HTML5 pattern attribute follows JavaScript regular expression syntax, which makes things rather awkward. You cannot test character properties, for example. Instead, you need to list down the allowed characters or ranges of characters.
Using the normative Scripts.txt file (by the Unicode Consortium), which defines the script (writing system) of all characters, I constructed the following:
pattern=
"[a-zA-Z0-9-_. \
\u0620-\u063F\u0641-\u064A\u066E-\u066F\u0671-\u06D3\u06D5\
\u06E5-\u06E6\u06EE-\u06EF\u06FA-\u06FC\u06FF\u0750-\u077F\
\u08A0\u08A2-\u08AC\uFB50-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\
\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC]{1,30}"
Starting from the set of all characters with script defined to be Arabic, I picked up those that are declared as letters (General Category Lo or Lm), and then omitted those beyond BMP, the Basic Multilingual Plane.
Characters outside BMP are used very rarely, and to represent them in JavaScript syntax, you would need to either include the characters themselves or use two \u notations per character (one for each component of a surrogate pair). This does not sound realistic.
This is of course a “hardwired” solution: it may need updates if new Arabic letters are added to Unicode or the script of a character is changed from or to Arabic (which is highly unlikely). But I don’t expect to see new Arabic letters added to BMP during my lifetime.
I need character entity for this character ▼
I searched a lot on google but could not find. Any help is thankful in advance.
Its called large down arrow ▼ ▼
If you know how to type the character itself or you have it in a file, then you can find out is Unicode number e.g. by opening the file in Word, placing the cursor after the character and entering Alt X. This changes the character to its number. Or you can open the file in the BabelPad editor and move the cursor before the character; BabelPad displays the number at the bottom line of its window.
If you know a character from printed matter only or otherwise need to recognize a character from its graphic shape, you can use http://shapecatcher.com/ (a bit clumsy for a “filled” character like this, but easy for more normal characters).
Once you know the Unicode number, 25BC in this case, you can construct a character reference: ▼. Should you prefer decimal numbers, you can use a calculator and then use a reference like ▼. But hex numbers are generally better for readability of code, since character numbers are conventionally written in hex. (The Unicode name of this character is BLACK DOWN-POINTING TRIANGLE.)
There is no entity defined for this character in HTML 4.01 entities or even in the proposed HTML5 entities (called “named character references” there). But an entity would not be particularly useful; entity names are not very mnemonic, and (numeric) character references can be used for any character.
Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.
When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:
double quote: " is “ and ” (“ and ”)
single quote: ' is ‘ and ’ (‘ and ’)
ellipsis: ... is … (…)
I understand ones that can't be represented without unicode like © and ¢, but even for those, I wonder.
When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
When should you use these unicode equivalents? Are they more semantic than not using them?
Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.
In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.
Are they better interpreted by devices (copy/paste/print)?
Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.
Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).
I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.
Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.
I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.
Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.
I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).
I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).
HTML entities serve a triple purpose:
Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.
Escape characters that have a special meaning in HTML, such as angle brackets.
Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.
Update:
My info is correct but I suspect I've answered the wrong question...
On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.
Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".
The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.
Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.
When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.
HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.
I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.