Many posts and pages on social media often make use of uncommon stylised lettering to give a humorous effect or otherwise be eye-catching such as:
<div id="..." class="...">
<p>LIKE THE 🅱ACKUP JUST INCASE WE GET PERMANENTLY ZUCCED </p>
<p>
and this:
<a>Vaporwave</a>
What exactly are the "🅱" and "V" type characters and what are their purposes? Wouldnt using normal text characters make these redundant?
Edit:
adding an image of the code for those who may not be able to see the characters:
🅱 is Unicode codepoint U+1F171 SQUARED LATIN CAPITAL LETTER B. It is defined in the Enclosed Alphanumeric Supplement section of the Unicode standard.
From this Wikipedia explanation:
Enclosed alphanumerics is a Unicode block of typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop. There is another block for these characters (U+1F100—U+1F1FF), encoded in the Supplementary Multilingual Plane, which contains the set of Regional Indicator Symbols as of Unicode 6.0.
Purpose
Many of these characters were originally intended for use as bullets for lists.[3] The parenthesized forms are historically based on typewriter approximations of the circled versions.[3] Although these roles have been supplanted by styles and other markup in "rich text" contexts, the characters are included in the Unicode standard "for interoperability with the legacy East Asian character sets and for the occasional text context where such symbols otherwise occur."[3] The Unicode Standard considers these characters to be distinct from characters which are similar in form but specialized in purpose, such as the circled C, P or R characters which are defined as copyright and trademark symbols or the circled a used for an at sign.[3]
V is Unicode codepoint U+FF36 FULLWIDTH LATIN CAPITAL LETTER V. It is defined in the Halfwidth and Fullwidth Forms section of the Unicode standard.
From this Wikipedia explanation:
In CJK (Chinese, Japanese and Korean) computing, graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形; in CJK and Japanese: 全角) and halfwidth (in Taiwan and Hong Kong: 半形; in CJK and Japanese: 半角) characters. With fixed-width fonts, a halfwidth character occupies half the width of a fullwidth character, hence the name.
In the days of computer terminals and text mode computing, characters were normally laid out in a grid, often 80 columns by 24 or 25 lines. Each character was displayed as a small dot matrix, often about 8 pixels wide, and an SBCS (single byte character set) was generally used to encode characters of western languages.
For a number of practical and aesthetic reasons, Han characters would need to be twice as wide as these fixed-width SBCS characters. These "fullwidth characters" were typically encoded in a DBCS (double byte character set), although less common systems used other variable-width character sets that used more bytes per character.
Halfwidth and Fullwidth Forms is also the name of a Unicode block U+FF00–FFEF.
In Unicode
In Unicode, if a certain grapheme can be represented as either a fullwidth character or a halfwidth character, it is said to have both a fullwidth form and a halfwidth form.
Halfwidth and Fullwidth Forms is the name of Unicode block U+FF00–FFEF, the last of the Basic Multilingual Plane excepting the short Specials block at U+FFF0–FFFF.
Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled by U+3000 "ideographic space."
Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul characters – see half-width kana. Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.
Related
I´m searching for a list of exponents like ¹²³ and so on and the same with letters. Note these still remain superscripted even in plain text.
Does something like these exist? If not, how can I create those?
(I need them for a website-project)
Unicode versions of superscripted/subscripted characters exist for all ten digits but not for all letters. They remain superscripted/subscripted in a plain-text environment without the need of format tags such as <sup>/<sub>.
However (as of v14), not all letters have Unicode superscripts. Furthermore, they are scattered along different Unicode ranges, and are in fact used mainly for phonetic transcription. Additionally, they are used for compatibility purposes especially if the text does not support markup superscripts and subscripts.
Exponent characters:
These are mostly used for mathematical and referencing usage.
- ⁰ [U+2070]
- ¹ [U+00B9, Latin-1 Supplement]
- ² [U+00B2, Latin-1 Supplement]
- ³ [U+00B3, Latin-1 Supplement]
- ⁴ [U+2074]
- ⁵ [U+2075]
- ⁶ [U+2076]
- ⁷ [U+2077]
- ⁸ [U+2078]
- ⁹ [U+2079]
- ⁺ [U+207A]
- ⁻ [U+207B]
- ⁼ [U+207C]
- ⁽ [U+207D]
- ⁾ [U+207E]
- ⁿ [U+207F]
- ⁱ [U+2071]
The "linear", "squared", and "cubed" subscripts are the most familiar and are found in Latin-1 Supplement. All the others are found in Superscripts and Subscripts. Add 0x2070 to all the non-Latin-1 Supplement superscripts to obtain the code point value of these digits. See this Wikipedia article and the official Unicode codepage segment.
Interesting notes
There are also subtle differences between <sup> subscripts and Unicode subscripts; Unicode subscripts are entirely different codepoints altogether, and some fonts professionally design subscripted letters because <sup> subscripts may look thin.
Compare x² with x2, similarly x⁺ with x+ (the first involves Unicode, the second is markup)
The best solution is to use markup, such as <sup>.
You can't create the characters, but you can format then as super-scripts if you are generating HTML.
As to find which exist, you just have to use an unicode-character searching resource and look for "superscript" to have a listing -
This query, for example:
https://www.fileformat.info/info/unicode/char/search.htm?q=superscript&preview=entity
As you can see, all digits are available (more than once, even), but very few letters.
However, if you intend to generate HTML output, the <sup> tag will work for any text you want, and give the necessary semantic meaning to the text - you can read about it and try it online here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.
Here is an example:
<h1 title="Hi! As a title attribute,
I can contain horizontal tabs
and carriage returns
and line feeds.">HTML's handling of &009; | &010; | &013;</h1>
<p>Hello. As a paragraph element, I can't contain horizontal tabs
or carriage returns
or line feeds.</p>
<input type="submit" value="I am a value attribute and
like title I can also handle line feeds" /><br />
<input type="submit" value="I am another value attribute. Like title I can handle horizontal tabs" /><br />
<input type="submit" value="I am a third value attribute.
Unlike title I can't handle carriage returns" />
Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?
It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.
https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.
Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.
Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.
First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.
Secondly, your example isn't correct. All the HTML entities you mention are printable characters:
is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)
is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)
(And please note that Windows line feeds are a combination of CR+LF.)
If you're really talking about control characters:
EOT End of Transmission
ACK Acknowledgement
BEL Bell
...
... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:
Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into the
input stream via script APIs such as document.write().)
If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must
then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in
the input to the tokenization stage.
... but I suspect you are only interested in white-space collapsing:
In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space ()
[...]
In particular, user agents should collapse input white space sequences
when producing output inter-word space.
[...]
The PRE element is used for preformatted text, where white space is
significant.
In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).
Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)
Hello I am trying to compile an EPUB v2.0 with html code extracted from Indesign. I have noticed there are a lot of "special characters" either at the beginning of a paragraph or at the end. For example
<p class="text_indent0px font_size0_8em line_height1_325 margin_bottom1px margin_left0px margin_right0px sans_serif floatleft">E<span class="small_caps">VELYNE</span> </p>
What is this
and can I either get rid of it or replace it with a "nbsp;"?
	
Is the ascii code for tabs. So I guess the paragraphs were indented with tabs.
If you want to replace them with then use 4 of them
That would be a horizontal tab (i.e. the same as using the tab key).
If you want to replace it, I would suggest doing a find/replace using an ePub editor like Sigil (http://sigil-ebook.com/).
represents the horizontal tab
Similarly represent space.
To replace you have to use
In the HTML encoding &#{number}, {number} is the ascii code. Therefore, is a tab which typically condenses down to one space in HTML, unless you use CSS (or the <pre> tag) to treat it as pre formatted text.
Therefore, it's not safe to replace it with a non-breaking or a regular space unless you can guarantee that it's not being displayed as a tab anywhere.
div:first-child {
white-space: pre;
}
<div> Test</div>
<div> Test</div>
<pre> Test</pre>
See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space and http://ascii.cl/
is the entity used to represent a non-breaking space
decimal char code of space what we enter using keyboard spacebar
decimal char code of horizontal tab
and both represent space but is non-breaking means multiple sequential occurrence will not be collapsed into one where as for the same case, ` will collapse to one space
= approx. 4 spaces and approx. 8 spaces
There are four types of character reference scheme used.
Using decimal character codes (regex-pattern: &#[0-9]+;),
Using hexadecimal character codes (regex-pattern: &#x[a-f0-9]+;),
Using named character codes (regex-pattern: &[a-z]+;),
Using the actual characters (regex-pattern: .).
Al these conversions are rendered same way. But, the coding style is different. For example, if you need to display a latin small letter E with diaeresis then you could use any of the below convention:
ë (decimal notation),
ë (hexadecimal notation),
ë (html notation),
ë (actual character),
Likewise, as you said, what should be used (a) (decimal notation) or (b) (html notation) or (c) (decimal notation).
So, from the above analogy, it can be said that the (a), (b) and (c) are three different kind of notation of three different characters.
And, this is for your information that, (a) is a Horizontal Tab, the (b) one is the non-breaking space which is actually in decimal notation and the (c) is the decimal notation for normal space character.
Now, technically space at the end of the paragraph, is nothing but meaningless. Better, you could discard those all. And if you still need to use space inside <pre> elements, not in <p> or <div>.
Hope this helps...
I was wondering if all the language treats the same set of characters as white space charactes or is there any variation.
Can anyone provide complete list of White space characters separating the one which can be entered from keyboard? If it's different, the difference and the reason would be more appropriate. Any language is helpful if you don't bring out Whitespace or its variants(if any). I certainly don't want a complete list for language like Whitespace :)
Whether a particular character is categorized as a whitespace character or not should depend on the character set being used. That said, it is not impossible that a programming language can make its own definition of what constitutes whitespace.
Most modern languages use the Unicode Character set, which does have a definition for space separator characters. Any character in the Zs category is a space separator.
You can see the complete list here. In addition you can grep for ;Zs; in the official Unicode Character Database to see those characters. Note that the number of characters in this category may grow as new Unicode versions come into existence, so I will not say how many such characters exist, nor even attempt to list them.
In addition to the Zs Unicode category, Unicode also defines character properties. Among the properties defined by Unicode is a Whitespace property. As of Unicode 7.0, characters with this property include all of the characters with category Zs plus a few control characters (including U+0009, U+000A, U+000B, U+000C, U+000D, and U+0085). You can find all of the characters with the whitespace property at Unicode.org here.
Now many languages, even modern ones, have special symbols for regular expressions such as \s or [:space:] but beware, these only refer to certain characters from the ASCII set; generally these are restricted to
SPACE (codepoint 32, U+0020)
TAB (codepoint 9, U+0009)
LINE FEED (codepoint 10, U+000A)
LINE TABULATION (codepoint 11, U+000B)
FORM FEED (codepoint 12, U+000C)
CARRIAGE RETURN (codepoint 13, U+000D)
Now this list is interesting because it contains not only space separators (Zs), but also from the "Control, Other" category (Cc). This is what a programming language generally means when it uses the term "whitespace."
So probably the best way to answer your question for a "complete list" of whitespace characters is to say "it depends on what you mean." If you mean "classic whitespace" it is probably the six characters listed above. If you want something more "modern" then it is the union of those six with all the characters from the Unicode category Zs. Then again, you might need to look within other blocks, too (e.g., U+1361 as mentioned in a comment to your question by Jerry Coffin). It also depends on what you intend to do with these space characters.
Now one last thing: Unicode doesn't have every character in the world yet; it keeps growing. It is possible that someday new space characters will be added. For now, category Zs + the classics are your best bet.
There are currently 25 Unicode whitespace characters with the following hexadecimal 'code points':
9, A, B, C, D, 20, 85, A0,
1680, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
2007, 2008, 2009, 200A, 2028, 2029, 202F, 205F,
3000
Corresponding decimal values are:
9, 10, 11, 12, 13, 32, 133, 160,
5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198,
8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287,
12288
I originally acquired this information from Unicode.org, but my old link is no longer a working URL. Wikipedia has a nice page on the subject tho, at https://en.wikipedia.org/wiki/Whitespace_character if any are interested, which also gives 25 characters. (I have not cross-referenced that these characters are the same characters, but i trust that the Unicode Consortium has not made such a breaking, major change to their character set!)
I did find one simple page on unicode's website today, but it looks a bit more like a draft html page rather than anything supporting or claiming an official stance. But it does match what Unicode had previously posted as an official claim regarding what all of their whitespace characters are. (The link is in my comment below my answer.)
If you're looking for an efficient method, I use the following code:
(c <= 32 && c >= 0) || c == 127;
0 to 31 are the control characters, 32 is the SPACE character and 127 is the ESC character. This works for all the character sets I know, including UTF-8.
Putting two dashes on a page sometimes, like this -- in rare occassions messes the HTML up.
For instance, if you enter -- into your Wordpress blog it'll actually munch it into a single -. This doesn't work well for code that requires --options --to --be --specified --this --way.
The HTML entity for – is &ndash and the longer — is — but what is the HTML entity to enter NORMAL DASH - in a page?
This should do it. It's not listed as a dash, you need to find a place that lists it as the minus sign.
Code block shows code:
-
In use: - (-)
Double: -- (--)
EDIT: My source for this answer.
This tool makes it easy to find (instead of looking through a chart): http://amp-what.com/#q=dash
The HTML character names dash; and hyphen; (use with leading &) have the same Unicode, U+2010.
From Wikipedia (Dash, Hyphen, Plus and minus signs) and HTML spec:
character
Unicode (hexadecimal)
HTML entity
hyphen
‐
U+2010 ‐ ‐
‐ ‐‐ ‐
Figure dash
‒
U+2012 ‒ ‒
En dash
–
U+2013 – –
– –
Em dash
—
U+2014 — —
— —
Horizontal bar
―
U+2015 ― ―
― ―
minus sign
−
U+2212 − −
− −
https://en.wikipedia.org/wiki/Dash#Unicode also lists the unicode codes for similar characters/marks like hyphen, minus and tilde.
Also see:
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
https://en.wikipedia.org/wiki/Unicode_and_HTML
Besides the "ndash" and "mdash" that you've mentioned (and which I think are valid for most uses), there is the hyphen as represented by ‐ (rendered: ‐) and the math symbol for minus represented by − (rendered: −).