ignore zero width non joiner in hunspell affix file - hunspell

I added this line to my affix file. But it has no effect on spell checker.
IGNORE <U-200C>
where is the UTF-8 encoded U-200C ZWNJ
I found this suggestion here...
https://bugs.documentfoundation.org/show_bug.cgi?id=60427
What is the correct way to ignore zero width non joiner in hunspell affix file?

From my research, the character after the IGNORE + space, is the encoded character in UTF-8. Not a representation of the character, but the character itself.
There may be other possible valid representations in the file.

Related

I get the wrong character, even using HTML encoding

On a website I use the futura font. I use some french language text, so I need the "à" character, amongst others. I use UTF-8 charset.
Weirdly, the "à" shows up as an r with an accent on top (see the pic)
i tried HTML encoding
à
But the result is the same. Is there something I can do about it?
There is rather insufficient information in the question, but the probable explanation is that the HTML document is not in the UTF-8 encoding but in the ISO-8859-1 and the browser is interpreting it as ISO-8859-2 encoded. The letter “à” has the code E0 (hexadecimal) in ISO-8859-1; in ISO-8859-2, this code denotes the letter “ŕ”.
How to fix this? It depends on how the problem was created, especially how the character encoding is declared (or guessed by browsers). See
https://www.w3.org/International/questions/qa-html-encoding-declarations .

Can I use HTML-entities without a fallback?

I am wondering, if I can use html-entities like
<h5><em>⇆</em> Headline</h5>
without any fallback if I use utf-8? (because on my systems this works totally fine). Are all these chars from http://dev.w3.org/html5/html-author/charref really all embedded into the utf-8-charset by default?
And how would I use it correctly, like this:
<h5><em>⇆</em> Headline</h5>
that
<h5><em>&lrarr; </em> Headline</h5>
or
<h5><em>⇆</em> Headline</h5>
There are two separate issues here:
get the browser to understand which character you want
render that character visually
For the first point, there are two options:
Embed the character directly as is, for which you will need to serve the HTML in an encoding that can encode that character. Yes, "⇆" is a Unicode character and can be encoded by any Unicode encoding. UTF-8 is the best choice here. The browser then simply needs to understand that the document is encoded in UTF-8 and it will be able to read and understand the character correctly. Set the appropriate HTTP header to denote the encoding.
Embed the character as an HTML entity. HTML entities is a way to embed any arbitrary character using only ASCII characters, e.g. &lrarr;. To encode this, your encoding of choice only needs to be able to encode &, l, r, a and ;, which are very standard characters in any encoding. This special sequence of characters is understood by the browser to mean the character "⇆". By embedding characters as HTML entities you can largely ignore the intricacies of managing encodings correctly, but it makes your source code rather unreadable. You should not do this in this day and age.
Whether you use named entities (&lrarr;) or refer to the character by its Unicode code (⇆) doesn't really matter, they both result in the same thing.
Having handled this, the character needs to be actually rendered as a glyph on screen. For this, an appropriate font is necessary. You'll have to test whether most of your target audience uses a system which has a font installed by default which contains this character. You can also provide your own font to the browser which contains this character as a web font.

IBM Extended ASCII Characters in HTML

I'm trying to get special characters into HTML, and am not sure if this is even possible. If anyone remembers Kroz, or just about every DOS interface - there is a special set of shape characters. I'm wanting to use the single braces, double braces, shadows, and other shape characters, but I can't seem to track any of these down anywhere.
Also, will using these characters in an HTML environment present any localization concerns / will there be a required charset?
Thanks!
There is no “extended ASCII”; ASCII ends at code position 127 decimal, 7F hexadecimal. What is called “extended ASCII” is a set of mutually incompatible 8-bit encodings that contain the printable ASCII characters in the same positions as in ASCII. In your case, you seem to want to use the Code Page 437. All of its characters exist in Unicode. You can find the correspondence at
http://en.wikipedia.org/wiki/Code_page_437
which I believe to be correct in this issue; but the authoritative reference is
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT
There are various ways to enter the characters. You can use, say, “▓” as such in HTML, if you have some way of entering it and you use UTF-8 on the page. Alternatively, you can use character references like ▓.
Yes, similar characters exist in the UTF-8 character set. These are called box drawing characters.
See: http://www.fileformat.info/info/unicode/block/box_drawing/utf8test.htm

HTML Unicode Issue: How to display special characters

Currently, I have my webpage set to Unicode/UTF-8. When trying to display a special character (for example, em dash, double arrow, etc), it shows up as a question mark symbol. I cannot change these characters to the HTML entity equivalent. How can I circumvent this issue?
A question mark in a lozenge, �, indicates a character-level error: the data contains bytes that do no represent any character, according to the character encoding being applied. This typically happens when the document is declared as UTF-8 encoded but is really in iso-8859-1, windows-1252, or some similar encoding. Windows-1252 is a common default encoding used by various programs on Windows platforms. So you may need to open the file in your authoring program and re-save it as UTF-8 encoded.
If problems remain, please post the URL. Posting the code alone is not sufficient, since the character encoding is primarily specified in HTTP headers.
If you see a question mark in a small box, then it might be a font-level problem (lack of glyph in the fonts being used), but this would be very rare for common characters like the em dash. Different browsers have different ways of indicating character- or font-level problems.
Make sure your document is set to the correct character encoding in the actual code editor, as well as in the doctype. Both are necessary. I spent hours trying to tweak HTML when the only problem was that I needed to set the text setting in Coda.
<head>
<meta charset="utf-8">
See the following screenshot:
Make sure your characters are actually UTF-8 characters. They will probably look something like this:
® or U+0020
http://www.kinsmancreative.com/transfer/char/index.php is a handy site for finding the decimal values of commonly used UTF-8 special characters if you need a reference.

Problem with character encoding of '

Why does this page
http://reboltutorial.com/cgi-bin/designpatterns-quiz.cgi
shows a ? in the sentence (at least on my firefox)
Separates an object’s abstraction from its implementation
whereas I am using utf-8 ?
Because the site declares UTF-8 as its content type, but the ´ is a ISO-8859-1 encoded character. Switch your editor to UTF-8 and type it in again (this is the recommended way over using entities).
Because whoever made that page actually encoded it iso-8859-1.
Just saying you're using UTF-8 doesn't make it so.
Where does that character come from? Is it static content, or does it come from a database or so? If it's from the database, make sure your entire web application stack uses UTF-8 as well. I don't know what you're using, but I know it's not straight-forward for PHP/Apache/MySQL.
If it's static content, make sure you save your HTML files as UTF-8.
Contrary to what the others have said, by the way, that character is not an ISO-8859-1 character, but a Windows-1252 character.