I'm having trouble with the encoding of a file, it seems. It's a textfile created using vim via SSH on a CentOS Server. When viewing the file in a browser there are problems with the encoding of the file.
I created a testfile, which explains this behavior:
res.tobscore.com/test.txt
And this is how I want the output to be like (this is just an html file using special characters to display the umlaute correctly):
res.tobscore.com/test.html
Using the command file and cat in the Terminal presented the following output:
user>file test.txt
test.txt: UTF-8 Unicode English text
user>cat test.txt
This is a testfile. I'm using the German Umlaute and the euro sign, to test
the encoding.
Euro - €
Scharfes S - ß
Ae - Ä
Oe - Ö
Ue - Ü
As you can see it's utf-8 unicode and is displayed correctly. Do you have any suggestion, why my browsers(Firefox and Chrome) have trouble displaying it? Using my tablet (set up in German) checking it with the native Browser showed correct results, but trying it with Chrome displayed the same horrible/wrong output.
Is there a way to set the encoding, so displaying it in every environment would present the same output?
Your server will most likely send the .txt file as Content-Type: text/plain, but no character set. Thus, the browser has to pick something (most likely ASCII, iso-8859-1 or iso-8859-15) and will display the UTF-8 bytes as garbage.
One workaround is to wrap your text-file in a little PHP script and send the correct encoding with it:
<?php
header ('Content-Type: text/plain; charset=utf-8');
readfile ('test.txt');
?>
readfile() will dump the contents of test.txt unaltered to your browser.
Note that is the webserver that picks the Content-Type based on the extension (.txt); you can probably change that, but you'd have to dig deep in the configuration files.
With UTF-8 text, browsers have a hard time figuring out the used encoding, and probably default to the system's encoding. Users would have to manually change the encoding (e.g. in Firefox, View > Character Encoding > Unicode (UTF-8) -- not a very workable solution).
One way to fix this is to configure the web server to send the text with the right Content-Type: text/plain; charset=utf-8 meta data (or via PHP, as suggested by JvO).
Or, you could try re-encoding the text file in an encoding that is easier to detect, e.g. UTF-16 with a BOM (Byte Order Mark). In Vim, save the file via:
:setlocal bomb
:w ++enc=utf16-le
Related
I recently ran an HTML file I was writing through this on-line HTML validator, and one of the diagnostics I got said,
The character encoding was not declared. Proceeding using
"windows-1252".
When I create a webpage, I write it in a text editor, which saves it as DOS-text (with CR-LF line endings). When I upload the file to my web-hosting provider, it gets converted (I think) on the server to Unix text (LF line endings). My text editor can also save files as Unicode including UTF-8, but I rarely find that necessary.
The standard online advice about specifying the character encoding in a web document is to include, just under the <head> tag, <meta charset="utf-8">. There is also advice that you should ensure that what you specify does not conflict with the information sent by the server in the HTTP headers when serving the document. Using Rex Swain's [online] HTTP viewer, I see that in the HTTP headers it just says,
Content-Type:·text/html
Should I follow the standard advice to specify the charset as UTF-8, even though the html file is never saved as such, or should I specify it as windows-1252, as assumed by that online validator, or as ISO-8859-1 as per one of the example values on W3Schools? Also, some examples of the charset metatag show it terminated as />. Which is the preferred syntax, and should there be a space before the slash?
I just used tex4ht and htlatex to convert a latex document into html, and now i have some serious trouble to integrate this html document in a web site i'm making (I'm using Laravel though).I think one of the reason I have some troubles is that htlatex output files are Unix encoded, not utf-8. If I just input the file using Laravel views and controllers without any modification, utf-8 characters are not displayed, and if i convert the file to utf-8, all utf-8 characters turn weird within notepad and I have to rewrite them one at a time (the html files contains 2000+ lines, I can't do that).I'm wondering how can I solve the problem.Is "puting the input html in an iframe" tag an any good solution ? Or is there a way to encode this file to utf-8 without messing with his content ? I'm so lost....
tex4ht uses Latin1 as the default encoding, characters unsupported by this encoding are output as XML entities. You can request UTF-8 output using the following command:
htlatex filename.tex "xhtml,charset=utf-8" " -cunihtf -utf8"
As an alternative, you can use Make4ht with -u option:
make4ht -u filename.tex
make4ht is replacement for htlatex with much more features.
I ran my web page through the W3C HTML validator and received this error.
The encoding ascii is not the preferred name of the character
encoding in use. The preferred name is us-ascii. (Charmod C024) ✉
Line 5, Column 70: Internal encoding declaration utf-8 disagrees with
the actual encoding of the document (us-ascii).
<meta http-equiv="content-type" content="text/html;charset=utf-8">
Apparently, I am not "actually" using UTF-8 even though I specified UTF-8 in my meta tag.
How do I, well, "actually" use UTF-8? What does that even mean?
The HTML5 mode of the validator treats a mismatch between encoding declarations as an error. In the message, “internal encoding declaration” refers to a meta tag such as <meta charset=utf-8>, and “actual encoding” (misleadingly) refers to encoding declaration in HTTP headers.
According to current HTML specifications (HTML5 is just a draft), the mismatch is not an error, and the HTTP headers win.
There is no real problem if your document only contains Ascii characters. Ascii-encoded data is trivially UTF-8 encoded too, because in UTF-8, any Ascii character is represented as a single byte, with the same value as in Ascii.
It depends on the software used server-side whether and how you can change the HTTP headers. If they now specify charset=ascii, as it seems, it is not a real problem except in validation, provided that you keep using Ascii characters only. But it is somewhat odd and outdated. Try to have the encoding information there changed to charset=utf-8. You need not change the actual encoding, but if you later add non-Ascii characters, make sure you save the file as UTF-8 encoded by selecting a suitable command or option in the authoring program.
Open your file in notepad, then save as > UTF-8 (next to the save button).
On unix-like system you might use iconv tool to convert file from one encoding to another.
It can also be used from the scope of programming language(e.g. php).
The proper function has same name:
http://www.php.net/manual/en/function.iconv.php
Specifying encoding is one thing. Saving documents in a proper encoding is another.
Edit your documents in editors supporting UTF-8 encoding. Preferably UTF-8 without BOM. Notepad++ may be a good start.
Have a read too: UTF-8 all the way through.
I'm working with a web framework that uses a dynamic character encoding in its html templates, like this:
<meta charset="${_response_encoding}">
The problem is when I try to edit this file in Eclipse, Eclipse thinks this is a literal encoding type, and thus refuses to open the file, saying:
"Unsupported Character Encoding" Character encoding
"${_response_encoding}" is not supported by this platform.
Is there any way to tell Eclipse to stop trying to be "smart" (because it plainly isn't) and just show me the text? I've tried using "Open With... Text Editor" but still same result.
Change the content type for HTML files:
Go to Windows -> preferences -> General -> Content types and change encoding (set them to utf-8) for all the file extensions you need.
Choose "Other" and then select UTF-8. Then your template will render as normal.
I had a similar problem, except I was receiving the error message when trying to save the document after changing the character encoding. I resolved the problem by doing the following in Eclipse before putting in the non-standard charset value:
Rename the file to have a non-HTML file extension.
Open the file using an editor other than the HTML one.
Change the charset value to the non-standard value you want.
Rename the file to have the original extension.
Open the file.
Follow the buttons and prompts to set the character encoding to the real encoding of the file.
After this, the file should still be usable while still having the non-standard charset value.
If you're having Eclipse treat it like an HTML file, it is being smart. That's not a valid encoding name. Have you tried just templating the entire meta tag?
(as mentioned in a comment) In Eclipse Indigo, when opening the file you see the Unsupported character encoding message along with a Set Encoding button. Us that button to set the UTF-8 encoding. Eclipse does not change the variable in the HTML file.
True, this is done on a file-by-file basis, however, in my project I import the same meta header file for every screen. Actually, I have only two files to setup (those that are logged in and those that are not).
Suppose I have an input field in a web page with charset UTF8; suppose I open a text file encoded with ISO-8859-1 as charset.
Now I copy and paste a string with special characters (like, for example, ô) from file to the input field : I see that the special characters is correctly displayed into input field.
Who does the conversion from ISO-8859-1 to UTF8? The browser?
When you open the file and copy/paste it to the browser, it ends up in Unicode, as that is what the browser's UI controls use internally. Who actually performs the conversion from ISO-8859-1 to Unicode depends on a few factors (what OS you are using, whether your chosen text editor is compiled to use Ansi or Unicode, what clipboard format(s) - CF_TEXT for Ansi, CF_UNICODETEXT for Unicode - the app uses for the copy, etc). But either way, when the web browser submits the form, it then encodes its Unicode data to the charset of the HTML/form during transmission.
In all likelihood, it's not really converted to UTF-8, but instead to the internal representation of characters used by the browser, which is quite likely to be UTF-16 (no matter what the encoding of the web page is).