I am working on a Talend Project, Where we are Transforming data from 1000's of XML files to CSV and we are creating CSV file encoding as UTF-8 from Talend itself.
But the issue is that some of the Files are created as UTF-8 and some of them created as ASCII , I am not sure why this is happening The files should always be created as UTF.
As mentioned in the comments, UTF8 is a superset of ASCII. This means that the code point for any ASCII characters will be the same in UTF8 as ASCII.
Any program identifying a file containing only ASCII characters will then simply assume it is ASCII encoded. It is only when you include characters outside of the ASCII character set that the file may be recognised by whatever heuristic the reading program uses.
The only exception to this is for file types that specifically state their encoding. This includes things like (X)HTML and XML which typically start with an encoding declaration.
You can go to the Advanced tab of the tFileOutputDelimited (or other kind of tFileOutxxx) you are using and select UTF-8 encoding.
Here is an image of the advanced tab where to perform the selection
I am quite sure the unix file util makes assumptions based on content of the file being in some range and or having specific start (magic numbers). In your case if you generate a perfectly valid UTF-8 file, but you just use only the ASCII subset the file util will probably flag it as ASCII. In that event you are fine, as you have a valid UTF-8 file. :)
To force talend to get a file as you wish, you can add an additional column to your file (for example in a tMap) and set an UTF-8 character in this column. The generated file will be in UTF8 as the other repliers mentioned.
Related
I'm generating cshtml files dynamically for our CMS and using UTF-8 as encoding. I also tried to open those files using Notepad++ and it says that the encoding is UTF-8.
And I just use the controller's View() method to serve the page:
return View(path);
But it still improperly renders the special characters to a wrong one. Like 'α' becoming 'α', or single quote becoming '’'. The generated files when inspecting contains the correct characters, but when it getting served, it shows incorrect characters.
I found the issue and solution. The cshtml files should be written not by simple UTF8 format, but UTF8-BOM file format. Non-BOM UTF8 cshtml files' special characters were converted into something when getting served through return View(path);.
Using the Moodle user import from CSV, we have the problem, that some German names with letters like Ö,ä,ü are imported "falsely". I presume, that the problem is in the encoding, here are the two possibilities, which I tested:
ANSI-encoding: The German letters disappear, for example Michael Dürr appears like Michael Drr in the listed users to import.
UTF-8-encoding: The letters appear as Michael Drürr
Does anyone has solution for the problem, or it has to be fixed one by one in the user's list?
I'm guessing the original file is using a different encoding. Try to convert the csv file to utf8 then import.
How do I correct the character encoding of a file?
you have to configure the database connection to make sure the encoding you choose for your webapplication (moodle) is the same as the encoding your database connection will choose.
look for SET NAMES 'utf8' or similar if you use mariadb/mysql as database.
and compare off course to the encoding of your import file. maybe you will need to convert it first. in any case the encoding of your web gui, the file, and the database connection (client character set) should be the same.
for web application check in your browser via View->Encoding or something similar, or check the meta header tag for the encoding in your html source code.
for file, use some editor or the like that will display the chars correctly and will indicate the charset.
for database, depends on your database.)
I have hard-coded Chinese characters in peoplecode. It is written to a CSV file. This CSV file is attached via an email notification. However, when the user receives the email and opens the CSV file attachment, the Chinese characters are being shown as some weird symbols or characters. I am using app engine by the way that uses PSUNX.
Anyone have any workaround about this?
The problem appears to be that you are not writing the same character set that your recipient is opening the file with. Since you are using UTF8, your choice does support the Chinese characters.
I see you have a couple options:
Find out the character set your recipient is using and use that character set when writing the file.
Educate the recipient that the file is in UTF8 and that they may need to open it differently. Here is a link on how to open a CSV using UTF8 in Excel.
Alright managed to solve it using UTF8BOM.
My table needs to support pretty much all characters (Japanese, Danish, Russian, etc.)
However, while saving the 2-columned table as CSV from Excel with UTF-8 encoding, then importing it with phpMyAdmin with UTF-8 encoding selected, a lot of the original characters go missing (the ones with special properties such as umlauts, accents, etc.) Also, anything following problematic characters is removed entirely. I haven't the slightest idea what is causing this problem.
EDIT: For those that come upon the same issue, I'd suggest opening your CSV file in Notepad++ and going to "Encoding > Convert to UTF-8" (not "Encode in UTF-8") first. Then import it. It will surely work.
I found an answer here:
https://help.salesforce.com/apex/HTViewSolution?id=000003837
Bascially save as a unicode text file from excel,
then replace all tabs with commas in code friendly text editor,
re-save as utf8
change file from .txt to .csv
exporting directly from excel to .csv causes problems with Japanese, this is why I went searching for help...
I have an Excel file in the Bengali language. To display the Bengali text properly I need Bengali fonts installed on the PC.
I converted the Excel file into CSV using Office 2010. But it only shows '?' marks instead of the Bengali characters. Then I used the Google Docs for the conversion, with the same problem, but with unreadable characters rather than '?'s. I pasted extracts from that file in an HTML file and tried to view it in my browser unsuccesfully.
What should I do to get a CSV file from an .xlsx file in Bengali so that I can import that into a MySQL database?
Edit: The answer accepted in this SO question made me go to Google Docs.
According to the answers to the question Excel to CSV with UTF8 encoding, Google Docs should save CSV properly, contrary to Excel, which destroys all characters that are not representable in the “ANSI” encoding being used. But maybe they changed this, or something wrong, or the analysis of the situation is incorrect.
For properly encoded Bangla (Bengali) processed in MS Office programs, there should be no need for any “Bangla fonts”, since the Arial Unicode MS font (shipped with Office) contains the Bangla characters. So is the data actually in some nonstandard encoding that relies on a specially encoded font? In that case, it should first be converted to Unicode, though possibly it can be somehow managed using programs that consistently use that specific font.
In Excel, when using Save As, you can select “Unicode text (*.txt)”. It saves the data as TSV (tab-separated values) in UTF-16 encoding. You may then need to convert it to use comma as separator instead of tab, and/or from UTF-16 to UTF-8. But this only works if the original data is properly encoded.