What are these chars? [duplicate] - html

This question already has answers here:
ฏ๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎๎ํํํํํํํํํํํํํํํํํํํํํํํํํํ Why does this character never end? [closed]
(2 answers)
Closed 9 years ago.
How is possible to do this ʘͥͥͥͥͥͥͥͥͥͥͥͥͥͥ͒_ʘͥͥͥͥͥͥͥͥͥͥͥͥͥͥ͒ in a html input field?
Or this:
ه҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉҉҉҉҉҉҉҉҉҉҉҉҉҉ ه҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉̿҉҉҉҉҉҉҉҉҉҉҉҉҉҉
I just copied and pasted from a Twitter profile. I guess that they are pasting unicode chars in hex but looking at http://www.htmlescape.net/unicode_charts.html I couldn't find any char that overflow vertically or left.
I'm asking because I want to know how this can be avoided. It's possible that people start to use this and break the look and style of many commentable sites, just like I did. Sorry...

It's so called Combining Diacritical Marks. The code in the question, in particular, uses U+0365 COMBINING LATIN SMALL LETTER I character. You can easily create yourself something very similar right in the browser, using this code:
var iMark = String.fromCharCode(869); // 0x365 in decimal
var testString = 'f' + Array(11).join(iMark); // f with 10 dots above
This behaviour (combining all these marks instead of using just a single one) is well described in the official FAQ:
Q: Unicode doesn't contain the character I need, which is a Latin
letter with a certain diacritical mark. Can you add it?
A: Unicode can already express almost anything you will ever need in
any field of study by using a combination of Latin, IPA, or other base
letters with the various combining diacritical marks. For example, if
you need a highly specialized character such as “Z with stroke,
cedilla, and umlaut”, you can get this combination by using three
existing character codes in combination:
U+01B5 LATIN CAPITAL LETTER Z WITH STROKE
U+0327 COMBINING CEDILLA
U+0308 COMBINING DIAERESIS
With appropriate rendering software, that sequence should produce a
glyph combination like this:
Even if the combination is not available in a particular font, it is
unambiguous and Unicode conformant systems should transmit and retain
the sequence without distortion, and it may be processed
programmatically.
How to deal with this (potential) nastiness without affecting the valid texts? One possible approach, I suppose, is normalizing (NFC) the strings first, then stripping away all the non-valid characters.

Related

IBM Extended ASCII Characters in HTML

I'm trying to get special characters into HTML, and am not sure if this is even possible. If anyone remembers Kroz, or just about every DOS interface - there is a special set of shape characters. I'm wanting to use the single braces, double braces, shadows, and other shape characters, but I can't seem to track any of these down anywhere.
Also, will using these characters in an HTML environment present any localization concerns / will there be a required charset?
Thanks!
There is no “extended ASCII”; ASCII ends at code position 127 decimal, 7F hexadecimal. What is called “extended ASCII” is a set of mutually incompatible 8-bit encodings that contain the printable ASCII characters in the same positions as in ASCII. In your case, you seem to want to use the Code Page 437. All of its characters exist in Unicode. You can find the correspondence at
http://en.wikipedia.org/wiki/Code_page_437
which I believe to be correct in this issue; but the authoritative reference is
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
There are various ways to enter the characters. You can use, say, “▓” as such in HTML, if you have some way of entering it and you use UTF-8 on the page. Alternatively, you can use character references like ▓.
Yes, similar characters exist in the UTF-8 character set. These are called box drawing characters.
See: http://www.fileformat.info/info/unicode/block/box_drawing/utf8test.htm

How to represent negative number in HTML

Simple question, what is the proper character to use for representing a negative number? Should I use a normal dash, a minus entity or is there a more appropriate entity to use?
To be more clear, in an HTML document if I want to display a negative temperature, should I use:
-5 °C
or
−5 °C
or something else?
There are two separate questions here: 1) what do you use as the character that acts as the sign of a negative number (or, more generally, how you write a negative number with characters), and 2) how do you represent that character in HTML. Only the latter is HTML-specific and can thus be considered on-topic at SO.
However, question 1 is, too, somewhat programming-related. In most computer languages (such as JavaScript, HTML, and CSS), a negative number is almost always written using the common ASCII hyphen, officially HYPHEN-MINUS, “-”, U+002D. For this reason, people often use the ASCII hyphen in general texts, too, and even in mathematical texts. This typically violates the rules of human languages. In most languages, the MINUS SIGN “−” U+2212 is preferred. It is also typographically much better, especial in quality fonts, where e.g. “−42” has a noticeable sign and in “-42” the ASCII hyphen is less noticeable. The MINUS SIGN also has better line breaking properties: web browsers do not treat it as allowing a line break between it and the following digit, as they may well do to the ASCII hyphen.
Having chosen to use the minus sign, the simplest approach is to use the character “−” itself. For this, you need some method of inputting it. You also need to take care of character encoding issues, normally using UTF-8, but this is something that should be done anyway.
You can also use the named character reference −. It stands for the minus sign, and it might be convenient casually when you need to use the character but lack a convenient quick way of typing it.
Here you can find different operators representation in html.
The other answers give you the correct information on the html, in that there are character codes that mean "minus", and the downside of using the hyphen is that the browser could word-wrap the number and put the digits on the line after the hyphen.
However, you asked "Should I use a normal dash"? That really depends on the context. In particular, if you want the user to be able to copy your text and paste it into another program, and have that program interpret your negative numbers as negative numbers, you are going to have to use a hyphen.
For example, copy the following two lines and paste it into an Excel spreadsheet:
−45
-45
You will notice that the first number is treated as text, and the second is treated as a number, even though according to the html, the opposite should happen.
To use a hyphen with negative numbers, and prevent wrapping, use a white-space style of "nowrap".

Unicode characters or encoded entities [duplicate]

This question already has answers here:
HTML and character encoding vs HTML Entity
(3 answers)
Closed 8 years ago.
I'm using some special characters like × (×) or … (…) in my html pages.
Somewhere I'm using unicode character directly, but somewhere I'm using encoded entity like &hellip.
I want to tidy up my code and can't decide what notation is better.
I could find just two pros and cons:
using character directly I can set text in javascript using text method like $("#button").text("Loading…"), instead of html which could lead to XSS issues
using characters directly can lead to encoding issues in case of misconfigured server
Maybe I'm missing something important? What is the best practice?
If you use Unicode directly there's a chance that someone saves the file with the incorrect encoding and all characters get mangled.
On the other hand, using entity codes means you can't search effectively in the source code; you have to remember how each non ASCII character is encoded. Same goes for database content. Readability suffers,too.
I prefer to use Unicode directly and everyone on the team knows that files must be encoded in UTF-8.

HTML pattern Arabic letters

I want to insert the Arabic letters in the pattern just like the English letters
pattern="[a-zA-Z0-9-_. ]{1,30}"
I have no idea how to accomplish the action.
The range for Arabic and Persian are shared so this code could be used for Arabic too.
[أ-يa-zA-Z]
This is the reference for finding the character range of Unicode languages:
preg_replace and preg_match arabic characters
http://unicode.org/charts/
The HTML5 pattern attribute follows JavaScript regular expression syntax, which makes things rather awkward. You cannot test character properties, for example. Instead, you need to list down the allowed characters or ranges of characters.
Using the normative Scripts.txt file (by the Unicode Consortium), which defines the script (writing system) of all characters, I constructed the following:
pattern=
"[a-zA-Z0-9-_. \
\u0620-\u063F\u0641-\u064A\u066E-\u066F\u0671-\u06D3\u06D5\
\u06E5-\u06E6\u06EE-\u06EF\u06FA-\u06FC\u06FF\u0750-\u077F\
\u08A0\u08A2-\u08AC\uFB50-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\
\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC]{1,30}"
Starting from the set of all characters with script defined to be Arabic, I picked up those that are declared as letters (General Category Lo or Lm), and then omitted those beyond BMP, the Basic Multilingual Plane.
Characters outside BMP are used very rarely, and to represent them in JavaScript syntax, you would need to either include the characters themselves or use two \u notations per character (one for each component of a surrogate pair). This does not sound realistic.
This is of course a “hardwired” solution: it may need updates if new Arabic letters are added to Unicode or the script of a character is changed from or to Arabic (which is highly unlikely). But I don’t expect to see new Arabic letters added to BMP during my lifetime.

Are unicode characters better or more semantic than the simple text versions?

When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:
double quote: " is “ and ” (“ and ”)
single quote: ' is ‘ and ’ (‘ and ’)
ellipsis: ... is … (…)
I understand ones that can't be represented without unicode like © and ¢, but even for those, I wonder.
When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
When should you use these unicode equivalents? Are they more semantic than not using them?
Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.
In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.
Are they better interpreted by devices (copy/paste/print)?
Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.
Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).
I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.
Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.
I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.
Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.
I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).
I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).
HTML entities serve a triple purpose:
Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.
Escape characters that have a special meaning in HTML, such as angle brackets.
Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.
Update:
My info is correct but I suspect I've answered the wrong question...
On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.
Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".
The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.
Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.
When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.
HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.
I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.