Messed up encoding using htlatex - html

I just used tex4ht and htlatex to convert a latex document into html, and now i have some serious trouble to integrate this html document in a web site i'm making (I'm using Laravel though).I think one of the reason I have some troubles is that htlatex output files are Unix encoded, not utf-8. If I just input the file using Laravel views and controllers without any modification, utf-8 characters are not displayed, and if i convert the file to utf-8, all utf-8 characters turn weird within notepad and I have to rewrite them one at a time (the html files contains 2000+ lines, I can't do that).I'm wondering how can I solve the problem.Is "puting the input html in an iframe" tag an any good solution ? Or is there a way to encode this file to utf-8 without messing with his content ? I'm so lost....

tex4ht uses Latin1 as the default encoding, characters unsupported by this encoding are output as XML entities. You can request UTF-8 output using the following command:
htlatex filename.tex "xhtml,charset=utf-8" " -cunihtf -utf8"
As an alternative, you can use Make4ht with -u option:
make4ht -u filename.tex
make4ht is replacement for htlatex with much more features.

Related

Razor not rendering special characters properly

I'm generating cshtml files dynamically for our CMS and using UTF-8 as encoding. I also tried to open those files using Notepad++ and it says that the encoding is UTF-8.
And I just use the controller's View() method to serve the page:
return View(path);
But it still improperly renders the special characters to a wrong one. Like 'α' becoming 'α', or single quote becoming '’'. The generated files when inspecting contains the correct characters, but when it getting served, it shows incorrect characters.
I found the issue and solution. The cshtml files should be written not by simple UTF8 format, but UTF8-BOM file format. Non-BOM UTF8 cshtml files' special characters were converted into something when getting served through return View(path);.

Encoding Issue in Talend Open Studio

I am working on a Talend Project, Where we are Transforming data from 1000's of XML files to CSV and we are creating CSV file encoding as UTF-8 from Talend itself.
But the issue is that some of the Files are created as UTF-8 and some of them created as ASCII , I am not sure why this is happening The files should always be created as UTF.
As mentioned in the comments, UTF8 is a superset of ASCII. This means that the code point for any ASCII characters will be the same in UTF8 as ASCII.
Any program identifying a file containing only ASCII characters will then simply assume it is ASCII encoded. It is only when you include characters outside of the ASCII character set that the file may be recognised by whatever heuristic the reading program uses.
The only exception to this is for file types that specifically state their encoding. This includes things like (X)HTML and XML which typically start with an encoding declaration.
You can go to the Advanced tab of the tFileOutputDelimited (or other kind of tFileOutxxx) you are using and select UTF-8 encoding.
Here is an image of the advanced tab where to perform the selection
I am quite sure the unix file util makes assumptions based on content of the file being in some range and or having specific start (magic numbers). In your case if you generate a perfectly valid UTF-8 file, but you just use only the ASCII subset the file util will probably flag it as ASCII. In that event you are fine, as you have a valid UTF-8 file. :)
To force talend to get a file as you wish, you can add an additional column to your file (for example in a tMap) and set an UTF-8 character in this column. The generated file will be in UTF8 as the other repliers mentioned.

Displayerror of textfile in browser

I'm having trouble with the encoding of a file, it seems. It's a textfile created using vim via SSH on a CentOS Server. When viewing the file in a browser there are problems with the encoding of the file.
I created a testfile, which explains this behavior:
res.tobscore.com/test.txt
And this is how I want the output to be like (this is just an html file using special characters to display the umlaute correctly):
res.tobscore.com/test.html
Using the command file and cat in the Terminal presented the following output:
user>file test.txt
test.txt: UTF-8 Unicode English text
user>cat test.txt
This is a testfile. I'm using the German Umlaute and the euro sign, to test
the encoding.
Euro - €
Scharfes S - ß
Ae - Ä
Oe - Ö
Ue - Ü
As you can see it's utf-8 unicode and is displayed correctly. Do you have any suggestion, why my browsers(Firefox and Chrome) have trouble displaying it? Using my tablet (set up in German) checking it with the native Browser showed correct results, but trying it with Chrome displayed the same horrible/wrong output.
Is there a way to set the encoding, so displaying it in every environment would present the same output?
Your server will most likely send the .txt file as Content-Type: text/plain, but no character set. Thus, the browser has to pick something (most likely ASCII, iso-8859-1 or iso-8859-15) and will display the UTF-8 bytes as garbage.
One workaround is to wrap your text-file in a little PHP script and send the correct encoding with it:
<?php
header ('Content-Type: text/plain; charset=utf-8');
readfile ('test.txt');
?>
readfile() will dump the contents of test.txt unaltered to your browser.
Note that is the webserver that picks the Content-Type based on the extension (.txt); you can probably change that, but you'd have to dig deep in the configuration files.
With UTF-8 text, browsers have a hard time figuring out the used encoding, and probably default to the system's encoding. Users would have to manually change the encoding (e.g. in Firefox, View > Character Encoding > Unicode (UTF-8) -- not a very workable solution).
One way to fix this is to configure the web server to send the text with the right Content-Type: text/plain; charset=utf-8 meta data (or via PHP, as suggested by JvO).
Or, you could try re-encoding the text file in an encoding that is easier to detect, e.g. UTF-16 with a BOM (Byte Order Mark). In Vim, save the file via:
:setlocal bomb
:w ++enc=utf16-le

force eclipse to ignore character encoding attribute

I'm working with a web framework that uses a dynamic character encoding in its html templates, like this:
<meta charset="${_response_encoding}">
The problem is when I try to edit this file in Eclipse, Eclipse thinks this is a literal encoding type, and thus refuses to open the file, saying:
"Unsupported Character Encoding" Character encoding
"${_response_encoding}" is not supported by this platform.
Is there any way to tell Eclipse to stop trying to be "smart" (because it plainly isn't) and just show me the text? I've tried using "Open With... Text Editor" but still same result.
Change the content type for HTML files:
Go to Windows -> preferences -> General -> Content types and change encoding (set them to utf-8) for all the file extensions you need.
Choose "Other" and then select UTF-8. Then your template will render as normal.
I had a similar problem, except I was receiving the error message when trying to save the document after changing the character encoding. I resolved the problem by doing the following in Eclipse before putting in the non-standard charset value:
Rename the file to have a non-HTML file extension.
Open the file using an editor other than the HTML one.
Change the charset value to the non-standard value you want.
Rename the file to have the original extension.
Open the file.
Follow the buttons and prompts to set the character encoding to the real encoding of the file.
After this, the file should still be usable while still having the non-standard charset value.
If you're having Eclipse treat it like an HTML file, it is being smart. That's not a valid encoding name. Have you tried just templating the entire meta tag?
(as mentioned in a comment) In Eclipse Indigo, when opening the file you see the Unsupported character encoding message along with a Set Encoding button. Us that button to set the UTF-8 encoding. Eclipse does not change the variable in the HTML file.
True, this is done on a file-by-file basis, however, in my project I import the same meta header file for every screen. Actually, I have only two files to setup (those that are logged in and those that are not).

How to display non-ASCII characters from a XML output

I get this output in a XML element:
£111.00
It should be £111.00.
How can i sort this out so that all unicode characters are displayed rather than the code. I am using linux tool wget to fetch the xml file from the Internet. Perhaps some sort of convertor?
I am viewing the file in putty , i am parsing the file and i want to clean the input before parsing.
I am using xml_grep2 to get the elements i want and then cat filename | while read .....
Ok i'm going to close this question now.
After parsing the file with xml_grep2 i was able to get a clean output however was seeing this à character in the file. I changed putty settings for character set to UTF-8 from ISO-8859 to resolve that.
You can use HTML::Entities to replace the entities with literal character codes. I don't know how good its coverage is, though. There are bound to be similar tools for other languages if you are not comfortable with Perl. http://metacpan.org/pod/HTML::Entities
sh$ echo '£111.00' | perl -CSD -MHTML::Entities -pe 'decode_entities($_)'
£111.00
This won't work if the HTML::Entities module is not installed. If you need to install it, there are numerous tutorials about the CPAN on the Internet.
Edit: Add usage example. The -CSD option might not be necessary on your system, but on OSX at least, I got garbage output without it.