Ruby HTML unicode to actual characters - html

I am trying to convert HTML numeric character references to a string.
Example:
イス シート 椅子
To the symbols they represent (sorry if this doesn't render properly for you):
イス シート 椅子
I've tried the following: CGI::unescapeHTML(str) but I still see the numeric character codes rather than the symbols.
I've tried writing the output to a file (just in case it's simply not rendering properly in the terminal) and opening it with TextEdit/vim but that hasn't helped.

You could use the htmlentities gem. There is also the hex notation to consider (e.g. イ is the same as イ or "イ"). There's no good reason to do this by hand (and probably miss various edge cases and notations that you might not be aware of) when there is a complete and tested library that will do it for you.

Related

Custom google font error in HTML

I have a blog where I use custom fonts from Google Fonts in each and every text of the <body> element, but whenever there is an inverted comma or a double inverted comma in my text, it is not shown as it should be - it is replaced by an unknown character.
I had even looked into the font and there is the character support for the inverted commas.
I don't think this has anything to do with your font.
If you look at the source code you will see the characters already are broken there:
This rather is a problem of your encoding. Your site is UTF-8, but the characters seem to be non-UTF-8. You either need to use UTF-8 characters or change the encoding of your site. (1st option is preferable)
If you change the site encoding to Windows-1252 (which is automatically suggested by Chrome based on the content) everything seems fine:
The question is how did you create this text? Maybe in Word and then copy and pasted? Or is your blog backend not UTF-8?
Also note there are two different characters: ’vs ´.
It's a special character. Please check below example
if you want to write "Don't" than you have to use "don’t"
if you want to write in double quote "highly sought users" than you have to use “highly sought users”
I hope this will help you.
Usually the special characters appears when you copy the text from other sources like MS word. This can be solved by manually entering inverted commas while entering or modifying in the database.

Prevent HTML Entity Conversion to Emoji

I've coded an HTML email and use "▶" to code a right-pointing triangle in place of an image in a call-to-action. This renders as anticipated except in iOS devices where this html entity is converted to its emoji counterpart. I also tried using the hex version instead of the decimal one with no success.
I've found posts where the solution utilizes php, but as this is an HTML email I can't use PHP.
Any way to prevent iOS from converting the HTML Entity into its emoji counterpart?
Here's the html entity I'm using: http://www.fileformat.info/info/unicode/char/25B6/index.htm
▶︎
U+FE0F and U+FE0E are ‘variation selectors’, signalling that, respectively, an emoji-like (coloured/animated) or text-like rendering is preferred, if available. If neither is used, the renderer can choose at will. Unfortunately iOS in certain scenarios defaults to the emoji variant and has to be manually put right.
(Hex vs decimal character reference is immaterial. You can include the raw characters too, you don't necessarily have to encode them as character or entity references, but as raw characters the existance of the variant selector would be hard to see in an editor.)

HTML encoding of Japanese text

I'm making a static HTML page that displays courtesy text in multiple languages. I noticed that if I paste ウェブサイトのメンテナンスの下で into Expression Blend, that text appears the same in the code. I think it's bad for compatibility and should be replaced by proper HTML entities.
I have tried http://www.opinionatedgeek.com/DotNet/Tools/HTMLEncode/encode.aspx but it returns me the same Japanese text.
Is it correct, from the point of view of browser compatibility, to paste that Japanese right into the source code of an HTML page?
Else, what is the correct HTML encoding of that text? Or, better, is there any tool that I can use to convert non-ASCII characters to HTML entities, possibly online and possibly free?
I think it's bad for compatibility and should be replaced by proper
HTML entities.
Quite the opposite actually, your preference should be to not use html entities but rather correctly declare document encoding as UTF-8 and use the actual characters. There are quite a few compelling reasons to do so, but the real question is why not use it since it's a well- and widely supported standard?
Some of those points have been summarised previously:
UTF-8 encodings are easier to read and edit for those who understand
what the character means and know how to type it.
UTF-8 encodings are just as unintelligible as HTML entity encodings
for those who don't understand them, but they have the advantage of
rendering as special characters rather than hard to understand decimal
or hex encodings.
[For example] Wikipedia... actually go through articles and convert
character entities to their corresponding real characters for the sake
of user-friendliness and searchability.
As long as you mark your web-page as UTF-8, either in the http headers or the meta tags, having foreign characters in your web-pages should be a non-issue. Alternately you could encode/decode these strings using encodeURI/decodeURI functions in JavaScript
encodeURI('ウェブサイトのメンテナンスの下で')
//returns"%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7"
decodeURI("%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B5%E3%82%A4%E3%83%88%E3%81%AE%E3%83%A1%E3%83%B3%E3%83%86%E3%83%8A%E3%83%B3%E3%82%B9%E3%81%AE%E4%B8%8B%E3%81%A7")
//returns ウェブサイトのメンテナンスの下で
If you are looking for a tool to convert a bunch of static strings to unicode characters, you could simply use encodeURI/decodeURI functions from a web-page developer console (firebug for mozilla/firefox). Hope this helps!
HTML entities are only useful if you need to represent a character that cannot be represented in the encoding your document is saved in. For example, ASCII has no specification for how to represent "€". If you want to use that character in an ASCII encoded HTML document, you have to encode it as € or not use it at all.
If you are using a character encoding for your document that can represent all the characters you need though, like UTF-8, there's no need for HTML entities. You simply need to make sure the browser knows what encoding the document is in so it can interpret it correctly. This is really the preferable method, since it simply keeps the source code readable. It really makes no sense to want to work with HTML entities if you can simply work with the actual characters.
See http://kunststube.net/frontback for some more information.

Browser is HTML Encoding a character before sending it?

I cant believe what im seeing here! I have a normal, basic html form (havent changed the enctype), if someone puts a strange japanese character in the field and posts the form then in my database it is saving an HTML encoded version of the character. I am not processing the string at all except with a Trim(). Using classic ASP (not out of choice i might add!). I have a feeling this might have something to do with utf-8/encoding but ive tried messing around with the meta tag and content type and been unable to get the character to come through properly. To make things harder i dont seem to be able to get classic ASP debugging in VS express 2010. Any comments appreciated :)
As you can see in this demo and read in the standard (4.10.22.6.4.2), characters that are not supported by the selected encoding (such as Japanese ones in an ISO8859-* or cp1252 encoding) are encoded as HTML entities.
If you are fine with incorrectly handling user input that contains html entities in the clear, you can replace all numeric HTML entities in the user input with the corresponding Unicode character (however, doing so in ASP is hard since there is no inverse function to Server.HTMLEncode and Unicode support is pretty much nonexistent in the first place.
As an alternative, use UTF-8 (and/or a web development platform from this millennium) and all these problems go away. However, since that may not be an option, you may want the to unescape the HTML entities in different programs, for example with HttpUtility.HtmlDecode in C#, html_entity_decode in PHP, or HTMLParser.unescape in Python.

Special HTML Characters

Ok, so I want to have the characters from below in my html page. Seems easy, except I can't find the HTML encoding for them.
Note: I would like to do this without having sized elements, plain ol' text would be fine ^_^.
Cheers.
You can see that they have a unicode number of the selected character - at the bottom of the picture ("U+266A: Eighth Note").
Simply use the last portion in a unicode character entity: ♪ - ♪
If your page is already UTF-8, you can simply paste it in.
Try encoding it as █ - that should do the trick!
In a UTF-8 encoded page, just copy and paste them as-is.
Otherwise, use the number that the dialog gives you for each character, e.g. ♪
However, when working with rather exotic characters, be very wary of font support. See e.g. this question for background: Unicode support in Web standard fonts
This page gives some information about support for the characters you want to use. They seem to be relatively well supported, but a test on Linux and Mac machines won't hurt.
Here is one comprehensive entity reference. If you want to convert symbols into their entity counterparts, I suggest using this converter.
My suggestion is to use hexadecimal reference. ( it's easy dont worry :) )
for example, the first character you have highlighted in red got ascii value of 175, which is AF in hex.
So in short you can encode it using %AF, and so on...
is it clear mate? Let me know if you need further explanation or help about this :)
Edit: my post is meant for url encoding.