Prevent HTML Entity Conversion to Emoji

Prevent HTML Entity Conversion to Emoji - html

I've coded an HTML email and use "▶" to code a right-pointing triangle in place of an image in a call-to-action. This renders as anticipated except in iOS devices where this html entity is converted to its emoji counterpart. I also tried using the hex version instead of the decimal one with no success.
I've found posts where the solution utilizes php, but as this is an HTML email I can't use PHP.
Any way to prevent iOS from converting the HTML Entity into its emoji counterpart?
Here's the html entity I'm using: http://www.fileformat.info/info/unicode/char/25B6/index.htm

▶︎
U+FE0F and U+FE0E are ‘variation selectors’, signalling that, respectively, an emoji-like (coloured/animated) or text-like rendering is preferred, if available. If neither is used, the renderer can choose at will. Unfortunately iOS in certain scenarios defaults to the emoji variant and has to be manually put right.
(Hex vs decimal character reference is immaterial. You can include the raw characters too, you don't necessarily have to encode them as character or entity references, but as raw characters the existance of the variant selector would be hard to see in an editor.)

Related

Render MS Symbol font characters in html5

I want to take characters in the Microsoft Symbol font (taken from the w:sym tag in a docx file) and render them in html. When I look at how Word writes out the characters when I save the doc as html, I see this:
<span style='mso-char-type:symbol;mso-symbol-font-family:Symbol'>Â</span>
This appears as a script R in both Word and Word's html output.
When I write the same thing in my own html file, I see the A-hat in the regular font, and Chrome's element inspector warns that the mso- properties are unknown.
In Word's html output there is lots of mso-specific stuff but nothing I can see that lets Chrome know how to interpret mso-char-type and mso-symbol-font. I see the same behavior in IE.
Is there an easy way to tell the browser to use the Symbol font? Or do I have to explicitly translate the Symbol font characters to Unicode (using a static translation table?)
Thanks,
Wayne

The Symbol font is a privately-encoded font, i.e. it places various glyphs in positions that should be occupied by other characters according to character code standards. This means that a web page using it will fail badly whenever the Symbol font is not available, or the page style sheet is overriden, or the browser behaves correctly: e.g., the letter “Â” cannot be rendered using the Symbol font, so the browser will use a fallback font.
The proper way is to use Unicode encoded characters, such as “ℜ”, in a UTF-8 encoded page, with font-family on the applicable element set to contain a list of fonts that contain this character. For general notes on this, see my Guide to using special characters in HTML.
An inappropriate way that has worked on some faulty browsers is to set font to Symbol in a manner generally understood by browsers, e.g. <font face=Symbol>Â</font> or <span style="font-family: Symbol">Â</span>. But as said, if this “works”, consider it a browser bug.
So yes, if you now have data using Symbol font, it should be mapped to Unicode characters.
Note that characters like “ℜ” (Black-letter capital R, not script R) are seldom needed. In particular, the standard (as per ISO 80000-2) notation for the real part of a complex number z is not ℜ(z) but Re z.

Ok, just removing mso-symbol and writing font-family:Symbol seems to have worked. However I suspect this is not really best practice... A table for translating symbols into unicode can be found here: http://www.alanwood.net/demos/symbol.html

HTML encoding of Japanese text

I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.
I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.
Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?

I think it's bad for compatibility and should be replaced by proper
HTML entities.
Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?
Some of those points have been summarised previously:
UTF-8 encodings are easier to read and edit for those who understand
what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings
for those who don't understand them, but they have the advantage of
rendering as special characters rather than hard to understand decimal
or hex encodings.
[For example] Wikipedia... actually go through articles and convert
character entities to their corresponding real characters for the sake
of user-friendliness and searchability.

As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript
encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"
decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で
If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!

HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.
If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.
See http://kunststube.net/frontback for some more information.

A list of Shapes and their escaped codes

I need a list of HTML escaped codes that represents shapes. I want the codes to be compatible with all browsers. For example, I want an escaped code to show a square on the page, in another page I want a triangle. From what I've heard it's possible.

If you're talking about characters that look like basic geometrical shapes, try opening charmap on Windows or a similar if you're on a different OS. As long as your HTML is using a Unicode encoding (like UTF-8), you can simply use those characters directly in your source code. If your code is written in something like ASCII, then you can use the numeric value of the character with a numeric entity, like this: ▲ (will render a ▲). Keep in mind that the character will only render if the font you're using to display it contains the character.

Are unicode characters better or more semantic than the simple text versions?

When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:
double quote: " is “ and ” (“ and ”)
single quote: ' is ‘ and ’ (‘ and ’)
ellipsis: ... is … (…)
I understand ones that can't be represented without unicode like © and ¢, but even for those, I wonder.
When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.

When should you use these unicode equivalents? Are they more semantic than not using them?
Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.
In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.
Are they better interpreted by devices (copy/paste/print)?
Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.
Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).
I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.
It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.
Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.

I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.
Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.
I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).
I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).

HTML entities serve a triple purpose:
Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.
Escape characters that have a special meaning in HTML, such as angle brackets.
Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.
Update:
My info is correct but I suspect I've answered the wrong question...

On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.
Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".
The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.
Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.

When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.
HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.
I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.

Ruby HTML unicode to actual characters

I am trying to convert HTML numeric character references to a string.
Example:
イス シート 椅子
To the symbols they represent (sorry if this doesn't render properly for you):
イス シート 椅子
I've tried the following: CGI::unescapeHTML(str) but I still see the numeric character codes rather than the symbols.
I've tried writing the output to a file (just in case it's simply not rendering properly in the terminal) and opening it with TextEdit/vim but that hasn't helped.

You could use the htmlentities gem. There is also the hex notation to consider (e.g. イ is the same as イ or "イ"). There's no good reason to do this by hand (and probably miss various edge cases and notations that you might not be aware of) when there is a complete and tested library that will do it for you.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008