I was wondering if all the language treats the same set of characters as white space charactes or is there any variation.
Can anyone provide complete list of White space characters separating the one which can be entered from keyboard? If it's different, the difference and the reason would be more appropriate. Any language is helpful if you don't bring out Whitespace or its variants(if any). I certainly don't want a complete list for language like Whitespace :)
Whether a particular character is categorized as a whitespace character or not should depend on the character set being used. That said, it is not impossible that a programming language can make its own definition of what constitutes whitespace.
Most modern languages use the Unicode Character set, which does have a definition for space separator characters. Any character in the Zs category is a space separator.
You can see the complete list here. In addition you can grep for ;Zs; in the official Unicode Character Database to see those characters. Note that the number of characters in this category may grow as new Unicode versions come into existence, so I will not say how many such characters exist, nor even attempt to list them.
In addition to the Zs Unicode category, Unicode also defines character properties. Among the properties defined by Unicode is a Whitespace property. As of Unicode 7.0, characters with this property include all of the characters with category Zs plus a few control characters (including U+0009, U+000A, U+000B, U+000C, U+000D, and U+0085). You can find all of the characters with the whitespace property at Unicode.org here.
Now many languages, even modern ones, have special symbols for regular expressions such as \s or [:space:] but beware, these only refer to certain characters from the ASCII set; generally these are restricted to
SPACE (codepoint 32, U+0020)
TAB (codepoint 9, U+0009)
LINE FEED (codepoint 10, U+000A)
LINE TABULATION (codepoint 11, U+000B)
FORM FEED (codepoint 12, U+000C)
CARRIAGE RETURN (codepoint 13, U+000D)
Now this list is interesting because it contains not only space separators (Zs), but also from the "Control, Other" category (Cc). This is what a programming language generally means when it uses the term "whitespace."
So probably the best way to answer your question for a "complete list" of whitespace characters is to say "it depends on what you mean." If you mean "classic whitespace" it is probably the six characters listed above. If you want something more "modern" then it is the union of those six with all the characters from the Unicode category Zs. Then again, you might need to look within other blocks, too (e.g., U+1361 as mentioned in a comment to your question by Jerry Coffin). It also depends on what you intend to do with these space characters.
Now one last thing: Unicode doesn't have every character in the world yet; it keeps growing. It is possible that someday new space characters will be added. For now, category Zs + the classics are your best bet.
There are currently 25 Unicode whitespace characters with the following hexadecimal 'code points':
9, A, B, C, D, 20, 85, A0,
1680, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
2007, 2008, 2009, 200A, 2028, 2029, 202F, 205F,
3000
Corresponding decimal values are:
9, 10, 11, 12, 13, 32, 133, 160,
5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198,
8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287,
12288
I originally acquired this information from Unicode.org, but my old link is no longer a working URL. Wikipedia has a nice page on the subject tho, at https://en.wikipedia.org/wiki/Whitespace_character if any are interested, which also gives 25 characters. (I have not cross-referenced that these characters are the same characters, but i trust that the Unicode Consortium has not made such a breaking, major change to their character set!)
I did find one simple page on unicode's website today, but it looks a bit more like a draft html page rather than anything supporting or claiming an official stance. But it does match what Unicode had previously posted as an official claim regarding what all of their whitespace characters are. (The link is in my comment below my answer.)
If you're looking for an efficient method, I use the following code:
(c <= 32 && c >= 0) || c == 127;
0 to 31 are the control characters, 32 is the SPACE character and 127 is the ESC character. This works for all the character sets I know, including UTF-8.
Related
I´m searching for a list of exponents like ¹²³ and so on and the same with letters. Note these still remain superscripted even in plain text.
Does something like these exist? If not, how can I create those?
(I need them for a website-project)
Unicode versions of superscripted/subscripted characters exist for all ten digits but not for all letters. They remain superscripted/subscripted in a plain-text environment without the need of format tags such as <sup>/<sub>.
However (as of v14), not all letters have Unicode superscripts. Furthermore, they are scattered along different Unicode ranges, and are in fact used mainly for phonetic transcription. Additionally, they are used for compatibility purposes especially if the text does not support markup superscripts and subscripts.
Exponent characters:
These are mostly used for mathematical and referencing usage.
- ⁰ [U+2070]
- ¹ [U+00B9, Latin-1 Supplement]
- ² [U+00B2, Latin-1 Supplement]
- ³ [U+00B3, Latin-1 Supplement]
- ⁴ [U+2074]
- ⁵ [U+2075]
- ⁶ [U+2076]
- ⁷ [U+2077]
- ⁸ [U+2078]
- ⁹ [U+2079]
- ⁺ [U+207A]
- ⁻ [U+207B]
- ⁼ [U+207C]
- ⁽ [U+207D]
- ⁾ [U+207E]
- ⁿ [U+207F]
- ⁱ [U+2071]
The "linear", "squared", and "cubed" subscripts are the most familiar and are found in Latin-1 Supplement. All the others are found in Superscripts and Subscripts. Add 0x2070 to all the non-Latin-1 Supplement superscripts to obtain the code point value of these digits. See this Wikipedia article and the official Unicode codepage segment.
Interesting notes
There are also subtle differences between <sup> subscripts and Unicode subscripts; Unicode subscripts are entirely different codepoints altogether, and some fonts professionally design subscripted letters because <sup> subscripts may look thin.
Compare x² with x2, similarly x⁺ with x+ (the first involves Unicode, the second is markup)
The best solution is to use markup, such as <sup>.
You can't create the characters, but you can format then as super-scripts if you are generating HTML.
As to find which exist, you just have to use an unicode-character searching resource and look for "superscript" to have a listing -
This query, for example:
https://www.fileformat.info/info/unicode/char/search.htm?q=superscript&preview=entity
As you can see, all digits are available (more than once, even), but very few letters.
However, if you intend to generate HTML output, the <sup> tag will work for any text you want, and give the necessary semantic meaning to the text - you can read about it and try it online here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
Many posts and pages on social media often make use of uncommon stylised lettering to give a humorous effect or otherwise be eye-catching such as:
<div id="..." class="...">
<p>LIKE THE 🅱ACKUP JUST INCASE WE GET PERMANENTLY ZUCCED </p>
<p>
and this:
<a>Vaporwave</a>
What exactly are the "🅱" and "V" type characters and what are their purposes? Wouldnt using normal text characters make these redundant?
Edit:
adding an image of the code for those who may not be able to see the characters:
🅱 is Unicode codepoint U+1F171 SQUARED LATIN CAPITAL LETTER B. It is defined in the Enclosed Alphanumeric Supplement section of the Unicode standard.
From this Wikipedia explanation:
Enclosed alphanumerics is a Unicode block of typographical symbols of an alphanumeric within a circle, a bracket or other not-closed enclosure, or ending in a full stop. There is another block for these characters (U+1F100—U+1F1FF), encoded in the Supplementary Multilingual Plane, which contains the set of Regional Indicator Symbols as of Unicode 6.0.
Purpose
Many of these characters were originally intended for use as bullets for lists.[3] The parenthesized forms are historically based on typewriter approximations of the circled versions.[3] Although these roles have been supplanted by styles and other markup in "rich text" contexts, the characters are included in the Unicode standard "for interoperability with the legacy East Asian character sets and for the occasional text context where such symbols otherwise occur."[3] The Unicode Standard considers these characters to be distinct from characters which are similar in form but specialized in purpose, such as the circled C, P or R characters which are defined as copyright and trademark symbols or the circled a used for an at sign.[3]
V is Unicode codepoint U+FF36 FULLWIDTH LATIN CAPITAL LETTER V. It is defined in the Halfwidth and Fullwidth Forms section of the Unicode standard.
From this Wikipedia explanation:
In CJK (Chinese, Japanese and Korean) computing, graphic characters are traditionally classed into fullwidth (in Taiwan and Hong Kong: 全形; in CJK and Japanese: 全角) and halfwidth (in Taiwan and Hong Kong: 半形; in CJK and Japanese: 半角) characters. With fixed-width fonts, a halfwidth character occupies half the width of a fullwidth character, hence the name.
In the days of computer terminals and text mode computing, characters were normally laid out in a grid, often 80 columns by 24 or 25 lines. Each character was displayed as a small dot matrix, often about 8 pixels wide, and an SBCS (single byte character set) was generally used to encode characters of western languages.
For a number of practical and aesthetic reasons, Han characters would need to be twice as wide as these fixed-width SBCS characters. These "fullwidth characters" were typically encoded in a DBCS (double byte character set), although less common systems used other variable-width character sets that used more bytes per character.
Halfwidth and Fullwidth Forms is also the name of a Unicode block U+FF00–FFEF.
In Unicode
In Unicode, if a certain grapheme can be represented as either a fullwidth character or a halfwidth character, it is said to have both a fullwidth form and a halfwidth form.
Halfwidth and Fullwidth Forms is the name of Unicode block U+FF00–FFEF, the last of the Basic Multilingual Plane excepting the short Specials block at U+FFF0–FFFF.
Range U+FF01–FF5E reproduces the characters of ASCII 21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment. U+FF00 does not correspond to a fullwidth ASCII 20 (space character), since that role is already fulfilled by U+3000 "ideographic space."
Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul characters – see half-width kana. Range U+FFE0–FFEE includes fullwidth and halfwidth symbols.
The Unicode catalogue includes a number of white-space characters, some of which don't appear to work in any context in HTML documents - but some of which, rather usefully, do.
Here is an example:
<h1 title="Hi! As a title attribute,
I can contain horizontal tabs
and carriage returns
and line feeds.">HTML's handling of &009; | &010; | &013;</h1>
<p>Hello. As a paragraph element, I can't contain horizontal tabs
or carriage returns
or line feeds.</p>
<input type="submit" value="I am a value attribute and
like title I can also handle line feeds" /><br />
<input type="submit" value="I am another value attribute. Like title I can handle horizontal tabs" /><br />
<input type="submit" value="I am a third value attribute.
Unlike title I can't handle carriage returns" />
Is there any official spec or series of guidelines which detail which white-space characters can be deployed in HTML documents and where?
It's a little unclear what you mean by work, but I'm going to assume you mean rendering, at which point what happens is really up to CSS.
https://www.w3.org/TR/CSS2/text.html#white-space-model defines how most whitespace characters are normalized away, unless you adjust the white-space property.
Note that the display of toolbars (such as from the title attribute) and form controls (such as from input elements) is not defined by any standard, leaving that effectively up to browsers.
Disclaimer: this answer was composed for the question as originally written, making explicit references to ASCII control characters. It was apparently a red herring so the information here may look confusing now.
First of all, I don't think nobody uses ASCII any more. In 2016 the only sensible encoding is UTF-8. Whatever, UTF-8 is a superset of ASCII (and you can use ASCII anyway) so the question is still be valid.
Secondly, your example isn't correct. All the HTML entities you mention are printable characters:
is 'CHARACTER TABULATION' (U+0009) (i.e. a tab)
is 'CARRIAGE RETURN (CR)' (U+000D) (i.e. a legacy MacOS line feed)
is 'LINE FEED (LF)' (U+000A) (i.e. a Unix line feed)
(And please note that Windows line feeds are a combination of CR+LF.)
If you're really talking about control characters:
EOT End of Transmission
ACK Acknowledgement
BEL Bell
...
... we first need to understand that HTML is meant to be plain text (as such, it's MIME content type is text/html). The HTML5 Living Standard provides a definition of control character that's wider than the ASCII one but in any case it doesn't seem to be allowed:
Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into the
input stream via script APIs such as document.write().)
If you actually refer to the characters in your example, some of then are considered exceptions in the parsing stage:
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF)
characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must
then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in
the input to the tokenization stage.
... but I suspect you are only interested in white-space collapsing:
In HTML, only the following characters are defined as white space
characters:
ASCII space ( )
ASCII tab ( )
ASCII form feed ()
Zero-width space ()
[...]
In particular, user agents should collapse input white space sequences
when producing output inter-word space.
[...]
The PRE element is used for preformatted text, where white space is
significant.
In other words, consecutive white space characters become a simple space (except inside <pre> tag). (I could only find a link for HTML 4 but that's something that hasn't changed significantly).
Is there any official spec or series of guidelines? Sure they are: you have the official W3C recommendations and the WHATWG specs but they're basically technical documentation mostly addressed at browser vendors: extensive, comprehensive and hard to decipher into plain English ;-)
This question already has answers here:
Regex to match only letters
(20 answers)
Closed 8 years ago.
I'm trying to create a regex for a HTML5 input so a user can only insert alpha characters that may be in a name. So characters from a-z, but also including ö,ü,â,æ ... and so on whilst also allowing whitespace and hyphens .
I have played around with some pattens but nothing seems to work correctly, this is what I have so far: <input type="text" name="firstname" pattern="[a-zA-Z\x7f-\xff] " title="">
Does anyone have a quick answer for this?
Since the HTML5 pattern attribute uses the same regex syntax as JavaScript, there is no simple way to refer to all alphabetic characters. You would need to write a rather huge expression (and to update it as new alphabetic characters are added to Unicode). You would need to start from the Unicode character database and the definition of General Category of characters there, or rely on someone having done that for you.
However, for your practical purposes, testing for “alpha characters that may be in a name” is even more complex. There are non-alphabetic characters used in names, such as left single quotation mark (‘) in addition to normal quotation mark (’), and who knows what characters there might be? If this is about people’s real names, it is very difficult to impose restrictions that do not discriminate. If this is about user names in a system, for example, you can define the repertoire as you like, but [a-zA-Z\x7f-\xff] does not look adequate (it includes some control characters and some non-alphabetic characters and excludes many Latin letters commonly used in Europe).
There is a very simple method to apply all you RegEx logic(that one can apply easily in English) for any Language using Unicode.
For matching a range of Unicode Characters like all Alphabets [A-Za-z] we can use
[\u0041-\u005A] where \u0041 is Hex-Code for A and \u005A is Hex Code for Z
'matchCAPS leTTer'.match(/[\u0041-\u005A]+/g)
//output ["CAPS", "TT"]
In the same way we can use other Unicode characters or their equivalent Hex-Code according to their Hexadecimal Order (eg: \u0100–\u017FF) provided by unicode.org
Try: [À-ž] as an example of Range. Modify your Range according to your requirement.
It will match all characters between À and ž.
Sample regEx would be
/[A-Za-zÀ-ž\-\s]+/
For more Ref: Latin Unicode Character
I have written one XSLT to transform xml to html. If input xml node contains only space then it inserts the space using following code.
<xsl:text> </xsl:text>
There is another numeric character which also does same thing as shown below.
<xsl:text> </xsl:text>
Is there any difference between these characters? Are there any examples where one of these will work and other will not?
Which one is recommended to add space?
Thanks,
Sambhaji
is a non-breaking space ( ).
is just the same, but in hexadecimal (in HTML entities, the x character shows that a hexadecimal number is coming). There is basically no difference, A0 and 160 are the same numbers in a different base.
You should decide whether you really need a non-breaking space, or a simple space would suffice.
It's the same. It's a numeric character reference.
A0 is the same number as 160. The first is in base 16 (hexadecimal) and the second is in base 10 (decimal, everyday base).