Character encoding in HTML file using WebView in JavaFX - html

I have a local HTML file that I would like to display in a WebView, in JavaFX. It's actually an html file from an epub file. I'm essentially trying to build my own epub viewer.
The epub's html file displays some text with diacritic marks. Most of these have been handled in the ebook files using html tags and a CSS, but not all. For example, the character "á" is used. When I open the html file in Chrome, it displays normally, but it shows up in my WebView program as "á".
I assume it's a character encoding thing. If I use the character value a&#769, then it shows up properly, but I'd rather not have to go through all the epub files I want to display and see what other characters don't work properly.
I have saved the html file with UTF-8 encoding, and anyway, it's the same file that is being read by Chrome and my program. Any suggestions?

Well, that wasn't too long. Explaining the question put me on the path to salvation :)
I just needed to change Eclipse's encoding, using this answer:
How to support UTF-8 encoding in Eclipse
Window > Preferences > General > Content Types, set UTF-8 as the default encoding for all content types.
Window > Preferences > General > Workspace, set "Text file encoding" to "Other : UTF-8".

Related

How to find all this kind of UNICODE � in multiple html pages (or, with notepad++)? [duplicate]

I have a bizarre problem: Somewhere in my HTML/PHP code there's a hidden, invisible character that I can't seem to get rid of. By copying it from Firebug and converting it I identified it as  or 'Zero width no-break space'. It shows up as non-empty text node in my website and is causing a serious layout problem.
The problem is, I can't get rid of it. I can't see it in my files even when turning Invisibles on (duh). I can't seem to find it, no search tool seems to pick up on it. I rewrote my code around where it could be, but it seems to be somewhere deeper in one of the framework files.
How can I find characters by charcode across files or something like that? I'm open to different tools, but they have to work on Mac OS X.
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
If you are using Textmate and the problem is in a UTF-8 file:
Open the file
File > Re-open with encoding > ISO-8859-1 (Latin1)
You should be able to see and remove the first character in file
File > Save
File > Re-open with encoding > UTF8
File > Save
It works for me every time.
It's a byte-order mark. Under Mac OS X: open terminal window, go to your sources and type:
grep -rn $'\xFEFF' *
It will show you the line numbers and filenames containing BOM.
In Notepad++, there is an option to show all characters. From the top menu:
View -> Show Symbol -> Show All Characters
I'm not a Mac user, but my general advice would be: when all else fails, use a hex editor. Very useful in such cases.
See "Comparison of hex editors" in WikiPedia.
I know it is a little late to answer to this question, but I am adding how to change encoding in Visual Studio, hope it will be helpfull for someone who will be reading this sometime:
Go to File -> Save (your filename) as...
And in File Explorer window, select small arrow next to the Save button -> click Save with Encoding...
Click Yes (on Do you want to replace existing file dialog)
And finally select e.g. Unicode (UTF-8 without signature) - that removes BOM

utf-8 / utf-16 conversion

When I design a html page in Dreamweaver CS6 I use its validation tool (it sends the code to w3c) and I get no errors. However, when I validate the same page in UltraEdit 21 (it uses HTML Tidy) I get the warning:
"Specified input encoding (utf-8) does not match actual input encoding (utf-16)"
The page is set as html5 (with <!doctype html>), as utf-8 (with <meta charset="utf-8">) and contains greek text.
Well, the question is:
Does that problem affect the appearance of the page? I mean, when I publish it, will a user in China, Germany, or ...Tierra del Fuego see the greek text?
If yes, the rest are less important, but I'll ask them:
What makes HTML Tidy to define the document as utf-16? Is there a character, word or visible string of any kind that I can remove/delete to correct the problem?
If I use <meta charset="utf-16"> will browsers parse the code correctly (ending to greek text for the global user)?
The actual file encoding will be set in Dreamweaver properties for the file.
Dreamweaver Help / Set title and encoding properties for a page:
The Title/Encoding Page Properties options let you specify the document encoding type that is specific to the language used to author your web pages as well as specify which Unicode Normalization Form to use with that encoding type.
Select Modify > Page Properties, or click the Page Properties button in the text Property inspector.
Choose the Title/Encoding category and set the options.
...
Encoding
Specifies the encoding used for characters in the document.
If you select Unicode (UTF‑8) as the document encoding, entity encoding is not necessary because UTF‑8 can safely represent all characters. If you select another document encoding, entity encoding may be necessary to represent certain characters. For more information on character entities, see www.w3.org/TR/REC-html40/sgml/entities.html.
...
Include Unicode Signature (BOM)
Includes a Byte Order Mark (BOM) in the document. A BOM is 2 to 4 bytes at the beginning of a text file that identifies a file as Unicode, and if so, the byte order of the following bytes. Because UTF‑8 has no byte order, adding a UTF‑8 BOM is optional. For UTF‑16 and UTF‑32, it is required.
Choose UTF-8 without BOM.
UltraEdit automatically detects encoding of a file on opening and displays it at bottom in status bar. See in UltraEdit Advanced - Configuration - File Handling - Unicode/UTF-8 Detection and press button Help for some more details.
UTF-16 is displayed for a file encoded in UTF-16 Little Endian with or without BOM on using standard status bar since UE v19.00. Clicking on this list box in status bar and selecting Unicode - UTF-8 results in converting the file from UTF-16 LE to UTF-8 which then matches with the character set declaration in head of your HTML5 file.
When using basic status bar in UE v19.00 or any later version or using any UltraEdit version prior v19.00, the status bar field right to the field with line, column and clipboard number starts with U- for a file with UTF-16 LE encoding.
The UltraEdit help page about the Status Bar contains more information about information shown in standard and basic status bar in UltraEdit.
Conversion to UTF-8 can be done with UltraEdit also with command UNICODE/UTF-8 to UTF-8 (Unicode Editing) in submenu Conversions in menu File.
There are 2 configuration settings at Advanced - Configuration - File Handling - Save which define saving a UTF-8 encoded file with or without byte order mark (BOM):
Write UTF-8 BOM header to all UTF-8 files when saved
Write UTF-8 BOM on new files created within this program (if above is not set)
As UTF-8 encoded HTML files should be always without BOM, it is better to have both UTF-8 BOM settings unchecked when using UltraEdit mainly for editing HTML files.
Another possibility to convert a file with UltraEdit is using command Save As from menu File and use appropriate Encoding / Format setting. UTF-8 in Save As dialog means saving the file as UTF-8 encoded file with BOM and UTF-8 - NO BOM without BOM independent on the two configuration settings for standard Save.
For converting all files in a single folder, a folder tree, opened in UltraEdit, etc. to UTF-8 using UltraEdit, there is an UltraEdit scripting solution, see How to convert all files in a folder to UTF-8?
Unfortunately UE v21.30.0.1024 still does not recognize the short character set declaration <meta charset="utf-8"> as defined in HTML5 standard. See Short utf-8 charset declaration in HTML5 header with details about this limitation and how it can be worked around. This limitation does not matter if within first 64 KB at least one UTF-8 encoded character is found as it will be the case for your HTML5 files with Greek text.
HTML Tidy installed with UltraEdit v21.30.0.1024 is of version 25 March 2009. I'm not sure if HTML Tidy really supports short charset declaration of HTML5. But it looks so because otherwise you would not see the warning on validating the HTML5 file with HTML Tidy.
It might be useful for you to read UltraEdit power tip Unicode text and Unicode files in UltraEdit/UEStudio as it looks like you do not really know what encoding and character set really means and why it is important for applications that the declaration in the HTML5 matches with really used encoding.
I answer your questions now after all those general UltraEdit stuff.
Does that problem affect the appearance of the page?
Although the file contains the declaration that file contents is encoded with UTF-8, but is in real encoded with UTF-16 Little Endian, the browsers display the contents correct. UTF-16 detection is very easy, especially with BOM present and therefore browsers ignore wrong declaration and interpret the bytes of the HTML file from beginning right as UTF-16 encoded text file.
However, it would be much better to convert the UTF-16 encoded HTML files to UTF-8 without BOM. UTF-8 without BOM is most commonly used for HTML files worldwide and then the character set declaration in head of your HTML file would also match with really used encoding.
What makes HTML Tidy to define the document as utf-16?
The really used encoding of your HTML file is UTF-16 Little Endian and UltraEdit, HTML Tidy and the browsers detect that already after reading in the first 2 bytes of the text file - the byte order mark. That's the reason why HTML Tidy suggests to declare the encoding in head of HTML file correct as utf-16 as the file is really encoded with.
If I use <meta charset="utf-16"> will browsers parse the code correctly?
In case of keeping the file encoded in UTF-16 LE (always 2 bytes per character), it would be better to declare the character set right with <meta charset="utf-16">. But no Unicode aware text editor or browser has a problem to automatically detect UTF-16 Little Endian encoding with byte order mark.
The character set declaration becomes very important mainly for UTF-8 encoded files (1, 2, 3 or even 4 bytes per character) or files with single-byte coded characters using a code page like Windows-1252 / ISO 8859-1 (Latin 1) or Windows-1253 / ISO 8859-7 (Latin/Greek).

How to render an HTML file offline?

I have a collection of html files that I gathered from a website using wget. Each file name is of the form details.php?id=100419&cid=13%0D, where the id and cid varies. Portions of the html files contain articles in Asian language (Unicode text). My intention is to extract the Asian-language text only. Dumping the rendered html using a command-line browser is the first step that I have thought of. It will eliminate some of the frills.
The problem is, I cannot dump the rendered html to a file (using, say, w3m -dump ). The dumping works if only I direct the browser (at the command-line) to the properly formed URL : http://<blah-blah>/<filename>. But this is way I will have to spend the time to download the files once again from the web. How do I get around this, what other tools could I use?
w3m -dump <filename> complains saying:
w3m: Can't load details.php?id=100419&cid=13%0D.
file <filname> shows:
details.php?id=100419&cid=13%0D: Non-ISO extended-ASCII HTML document text, with very long lines, with CRLF, CR, LF, NEL line terminators

Chinese text not displaying properly on web page

I am adding some Chinese text to a primarily English web page and am having trouble getting the characters to display properly. I've got the encoding set to UTF-8 in the meta content type tag, and I am copying/pasting the Chinese I was sent from a Word document. The text is still rendering as follows:
繁體中文版
rather than in Chinese characters:
繁體中文版
I'm sure it's an easy fix, but I'm lost as to how to make this happen.
Thanks very much for any help.
just because the meta tag says that the encoding is UTF8, doesn't mean that the content (file) itself is in UTF8. I mean, if you have a file index.html, the file itself should be encoded as utf8.
To change the encoding of a file in lunix, you can use this command
iconv --from-code=ISO-8859-1 --to-code=UTF-8 ./index.html > ./newIndex.html
but i guess that you are working with windows... and the only way i know change the encoding in windows is the Notepad++
Hope this helps

Eclipse HTML editor for HTML template files

I'm trying to edit phpbb HTML template file with Eclipse Ganymedes version 3.4.1 containing Web Developer Tools.
These template files contain HTML markup with template variable marks in form {variable_name}. Now, when trying to open such file, Eclipse trys to validate also these template variable marks.
For example template contains
<meta http-equiv="content-type" content="text/html; charset={S_CONTENT_ENCODING}" />
After opening Eclipse shows on editor body:
Unsupported Character Body
Character encoding "{S_CONTENT_ENCODING}" is not supported by this platform.
<button>Set encoding...</button>
How to solve this using WTP or is there any better editor for template editing purpose ?
Eclipse is trying to determine the text encoding from your meta tags and fails.
To override this behavior open the file in eclipse so you can see the error. Open the File menu and choose Properties (Alt-Enter) and eclipse will show you the properties dialog for the file where you can change the text file encoding.
I don't know if this can be disabled for all the files.
I've never used Eclipse on Linux, but it looks like the problem isn't really about Eclipse supporting variables -- it's about it trying to render what a character set that it thinks is called "{S_CONTENT_ENCODING}"
You can probably get around the problem by changing {S_CONTENT_ENCODING} to utf-8 (or latin-1 or whatever) in all of your templates. (This assumes that you aren't changing encoding from one template to the next, but I really doubt you are.)
Copy-paste utf-8 where you see {S_CONTENT_ENCODING} in one of the templates, and Eclipse should handle it the other {foo} instances from there.