Putting two dashes on a page sometimes, like this -- in rare occassions messes the HTML up.
For instance, if you enter -- into your Wordpress blog it'll actually munch it into a single -. This doesn't work well for code that requires --options --to --be --specified --this --way.
The HTML entity for – is &ndash and the longer — is — but what is the HTML entity to enter NORMAL DASH - in a page?
This should do it. It's not listed as a dash, you need to find a place that lists it as the minus sign.
Code block shows code:
-
In use: - (-)
Double: -- (--)
EDIT: My source for this answer.
This tool makes it easy to find (instead of looking through a chart): http://amp-what.com/#q=dash
The HTML character names dash; and hyphen; (use with leading &) have the same Unicode, U+2010.
From Wikipedia (Dash, Hyphen, Plus and minus signs) and HTML spec:
character
Unicode (hexadecimal)
HTML entity
hyphen
‐
U+2010 ‐ ‐
‐ ‐‐ ‐
Figure dash
‒
U+2012 ‒ ‒
En dash
–
U+2013 – –
– –
Em dash
—
U+2014 — —
— —
Horizontal bar
―
U+2015 ― ―
― ―
minus sign
−
U+2212 − −
− −
https://en.wikipedia.org/wiki/Dash#Unicode also lists the unicode codes for similar characters/marks like hyphen, minus and tilde.
Also see:
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
https://en.wikipedia.org/wiki/Unicode_and_HTML
Besides the "ndash" and "mdash" that you've mentioned (and which I think are valid for most uses), there is the hyphen as represented by ‐ (rendered: ‐) and the math symbol for minus represented by − (rendered: −).
Related
I´m searching for a list of exponents like ¹²³ and so on and the same with letters. Note these still remain superscripted even in plain text.
Does something like these exist? If not, how can I create those?
(I need them for a website-project)
Unicode versions of superscripted/subscripted characters exist for all ten digits but not for all letters. They remain superscripted/subscripted in a plain-text environment without the need of format tags such as <sup>/<sub>.
However (as of v14), not all letters have Unicode superscripts. Furthermore, they are scattered along different Unicode ranges, and are in fact used mainly for phonetic transcription. Additionally, they are used for compatibility purposes especially if the text does not support markup superscripts and subscripts.
Exponent characters:
These are mostly used for mathematical and referencing usage.
- ⁰ [U+2070]
- ¹ [U+00B9, Latin-1 Supplement]
- ² [U+00B2, Latin-1 Supplement]
- ³ [U+00B3, Latin-1 Supplement]
- ⁴ [U+2074]
- ⁵ [U+2075]
- ⁶ [U+2076]
- ⁷ [U+2077]
- ⁸ [U+2078]
- ⁹ [U+2079]
- ⁺ [U+207A]
- ⁻ [U+207B]
- ⁼ [U+207C]
- ⁽ [U+207D]
- ⁾ [U+207E]
- ⁿ [U+207F]
- ⁱ [U+2071]
The "linear", "squared", and "cubed" subscripts are the most familiar and are found in Latin-1 Supplement. All the others are found in Superscripts and Subscripts. Add 0x2070 to all the non-Latin-1 Supplement superscripts to obtain the code point value of these digits. See this Wikipedia article and the official Unicode codepage segment.
Interesting notes
There are also subtle differences between <sup> subscripts and Unicode subscripts; Unicode subscripts are entirely different codepoints altogether, and some fonts professionally design subscripted letters because <sup> subscripts may look thin.
Compare x² with x2, similarly x⁺ with x+ (the first involves Unicode, the second is markup)
The best solution is to use markup, such as <sup>.
You can't create the characters, but you can format then as super-scripts if you are generating HTML.
As to find which exist, you just have to use an unicode-character searching resource and look for "superscript" to have a listing -
This query, for example:
https://www.fileformat.info/info/unicode/char/search.htm?q=superscript&preview=entity
As you can see, all digits are available (more than once, even), but very few letters.
However, if you intend to generate HTML output, the <sup> tag will work for any text you want, and give the necessary semantic meaning to the text - you can read about it and try it online here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
Many posts and pages on social media often make use of uncommon stylised lettering to give a humorous effect or otherwise be eye-catching such as:
<div id="..." class="...">
<p>LIKE THE 🅱ACKUP JUST INCASE WE GET PERMANENTLY ZUCCED </p>
<p>
and this:
<a>Vaporwave</a>
What exactly are the "🅱" and "V" type characters and what are their purposes? Wouldnt using normal text characters make these redundant?
Edit:
adding an image of the code for those who may not be able to see the characters:
🅱 is Unicode codepoint U+1F171 SQUARED LATIN CAPITAL LETTER B. It is defined in the Enclosed Alphanumeric Supplement section of the Unicode standard.
From this Wikipedia explanation:
Enclosed alphanumerics is a Unicode block of typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop. There is another block for these characters (U+1F100—U+1F1FF), encoded in the Supplementary Multilingual Plane, which contains the set of Regional Indicator Symbols as of Unicode 6.0.
Purpose
Many of these characters were originally intended for use as bullets for lists.[3] The parenthesized forms are historically based on typewriter approximations of the circled versions.[3] Although these roles have been supplanted by styles and other markup in "rich text" contexts, the characters are included in the Unicode standard "for interoperability with the legacy East Asian character sets and for the occasional text context where such symbols otherwise occur."[3] The Unicode Standard considers these characters to be distinct from characters which are similar in form but specialized in purpose, such as the circled C, P or R characters which are defined as copyright and trademark symbols or the circled a used for an at sign.[3]
V is Unicode codepoint U+FF36 FULLWIDTH LATIN CAPITAL LETTER V. It is defined in the Halfwidth and Fullwidth Forms section of the Unicode standard.
From this Wikipedia explanation:
In CJK (Chinese, Japanese and Korean) computing, graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形; in CJK and Japanese: 全角) and halfwidth (in Taiwan and Hong Kong: 半形; in CJK and Japanese: 半角) characters. With fixed-width fonts, a halfwidth character occupies half the width of a fullwidth character, hence the name.
In the days of computer terminals and text mode computing, characters were normally laid out in a grid, often 80 columns by 24 or 25 lines. Each character was displayed as a small dot matrix, often about 8 pixels wide, and an SBCS (single byte character set) was generally used to encode characters of western languages.
For a number of practical and aesthetic reasons, Han characters would need to be twice as wide as these fixed-width SBCS characters. These "fullwidth characters" were typically encoded in a DBCS (double byte character set), although less common systems used other variable-width character sets that used more bytes per character.
Halfwidth and Fullwidth Forms is also the name of a Unicode block U+FF00–FFEF.
In Unicode
In Unicode, if a certain grapheme can be represented as either a fullwidth character or a halfwidth character, it is said to have both a fullwidth form and a halfwidth form.
Halfwidth and Fullwidth Forms is the name of Unicode block U+FF00–FFEF, the last of the Basic Multilingual Plane excepting the short Specials block at U+FFF0–FFFF.
Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled by U+3000 "ideographic space."
Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul characters – see half-width kana. Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.
The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.
Here is an example:
<h1 title="Hi! As a title attribute,
I can contain horizontal tabs
and carriage returns
and line feeds.">HTML's handling of &009; | &010; | &013;</h1>
<p>Hello. As a paragraph element, I can't contain horizontal tabs
or carriage returns
or line feeds.</p>
<input type="submit" value="I am a value attribute and
like title I can also handle line feeds" /><br />
<input type="submit" value="I am another value attribute. Like title I can handle horizontal tabs" /><br />
<input type="submit" value="I am a third value attribute.
Unlike title I can't handle carriage returns" />
Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?
It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.
https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.
Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.
Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.
First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.
Secondly, your example isn't correct. All the HTML entities you mention are printable characters:
is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)
is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)
(And please note that Windows line feeds are a combination of CR+LF.)
If you're really talking about control characters:
EOT End of Transmission
ACK Acknowledgement
BEL Bell
...
... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:
Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into the
input stream via script APIs such as document.write().)
If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must
then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in
the input to the tokenization stage.
... but I suspect you are only interested in white-space collapsing:
In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space ()
[...]
In particular, user agents should collapse input white space sequences
when producing output inter-word space.
[...]
The PRE element is used for preformatted text, where white space is
significant.
In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).
Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)
Hello I am trying to compile an EPUB v2.0 with html code extracted from Indesign. I have noticed there are a lot of "special characters" either at the beginning of a paragraph or at the end. For example
<p class="text_indent0px font_size0_8em line_height1_325 margin_bottom1px margin_left0px margin_right0px sans_serif floatleft">E<span class="small_caps">VELYNE</span> </p>
What is this
and can I either get rid of it or replace it with a "nbsp;"?
	
Is the ascii code for tabs. So I guess the paragraphs were indented with tabs.
If you want to replace them with then use 4 of them
That would be a horizontal tab (i.e. the same as using the tab key).
If you want to replace it, I would suggest doing a find/replace using an ePub editor like Sigil (http://sigil-ebook.com/).
represents the horizontal tab
Similarly represent space.
To replace you have to use
In the HTML encoding &#{number}, {number} is the ascii code. Therefore, is a tab which typically condenses down to one space in HTML, unless you use CSS (or the <pre> tag) to treat it as pre formatted text.
Therefore, it's not safe to replace it with a non-breaking or a regular space unless you can guarantee that it's not being displayed as a tab anywhere.
div:first-child {
white-space: pre;
}
<div> Test</div>
<div> Test</div>
<pre> Test</pre>
See https://developer.mozilla.org/en-US/docs/Web/CSS/white-space and http://ascii.cl/
is the entity used to represent a non-breaking space
decimal char code of space what we enter using keyboard spacebar
decimal char code of horizontal tab
and both represent space but is non-breaking means multiple sequential occurrence will not be collapsed into one where as for the same case, ` will collapse to one space
= approx. 4 spaces and approx. 8 spaces
There are four types of character reference scheme used.
Using decimal character codes (regex-pattern: &#[0-9]+;),
Using hexadecimal character codes (regex-pattern: &#x[a-f0-9]+;),
Using named character codes (regex-pattern: &[a-z]+;),
Using the actual characters (regex-pattern: .).
Al these conversions are rendered same way. But, the coding style is different. For example, if you need to display a latin small letter E with diaeresis then you could use any of the below convention:
ë (decimal notation),
ë (hexadecimal notation),
ë (html notation),
ë (actual character),
Likewise, as you said, what should be used (a) (decimal notation) or (b) (html notation) or (c) (decimal notation).
So, from the above analogy, it can be said that the (a), (b) and (c) are three different kind of notation of three different characters.
And, this is for your information that, (a) is a Horizontal Tab, the (b) one is the non-breaking space which is actually in decimal notation and the (c) is the decimal notation for normal space character.
Now, technically space at the end of the paragraph, is nothing but meaningless. Better, you could discard those all. And if you still need to use space inside <pre> elements, not in <p> or <div>.
Hope this helps...
In my html page I have displayed fractions using html special character. My idea is to display 1/2, 2/2 and 3/3.
I have used ⅓ for 1/3 and ⅔ for 2/3 and the special charactera are displayed correctly. I took reference from this link HTML Special Characters
But when I tried using &frac33; for 3/3 it is not working. It is just displaying as it is, not converting to special character.
Could you someone please tell me what is the html special character for 3/3.
Thank You
<sup>3</sup>⁄<sub>3</sub>
Result: 3⁄3
Not all fractions have their own special character. For those fractions (like 3/3) which don't have slanted fraction characters, use the HTML entity ⁄:
<sup>3</sup>⁄<sub>3</sub> = 3⁄3
There is no named (or numeric) character reference for a character representing 3/3, since there simply is no such character.
In theory, the FRACTION SLASH U+2044 “⁄” character (representable as ⁄ in HTML, among other thing) can be used between digits to suggest that rendering routines present the combination as a typographic fraction. In practice, only some typesetting programs can do this, and web browsers come nowhere near.
Trying to play with HTML markup and/or CSS to construct something that looks like a typographic fraction (comparable to ½ in appearance) tend to produce messy results, including uneven line spacing.
The practical option is to use just common notations like 2/2. But if you want something like a typographic fraction, you could use MathML with MathJax. More exactly, you would use the mfrac element in MathML with the attribute bevelled="true". Sample code:
<!doctype html>
<title>Fractions with MathJax and MathML</title>
<script src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
Here we have the common fraction ½, then
a simulation with HTML and CSS:
<sup>1</sup>⁄<sub>2</sub>.
Note that this tends to create uneven line spacing.
There are some cures to that, but let us see how MathML works:
<math>
<mfrac bevelled="true">
<mn>1</mn>
<mn>2</mn>
</mfrac>
</math>.
Some text here to demonstrate that line spacing has not
been disturbed here.
Sample rendering: