I have an Excel file in the Bengali language. To display the Bengali text properly I need Bengali fonts installed on the PC.
I converted the Excel file into CSV using Office 2010. But it only shows '?' marks instead of the Bengali characters. Then I used the Google Docs for the conversion, with the same problem, but with unreadable characters rather than '?'s. I pasted extracts from that file in an HTML file and tried to view it in my browser unsuccesfully.
What should I do to get a CSV file from an .xlsx file in Bengali so that I can import that into a MySQL database?
Edit: The answer accepted in this SO question made me go to Google Docs.
According to the answers to the question Excel to CSV with UTF8 encoding, Google Docs should save CSV properly, contrary to Excel, which destroys all characters that are not representable in the “ANSI” encoding being used. But maybe they changed this, or something wrong, or the analysis of the situation is incorrect.
For properly encoded Bangla (Bengali) processed in MS Office programs, there should be no need for any “Bangla fonts”, since the Arial Unicode MS font (shipped with Office) contains the Bangla characters. So is the data actually in some nonstandard encoding that relies on a specially encoded font? In that case, it should first be converted to Unicode, though possibly it can be somehow managed using programs that consistently use that specific font.
In Excel, when using Save As, you can select “Unicode text (*.txt)”. It saves the data as TSV (tab-separated values) in UTF-16 encoding. You may then need to convert it to use comma as separator instead of tab, and/or from UTF-16 to UTF-8. But this only works if the original data is properly encoded.
Related
I regularly export CSV files from Shopware and edit them in Excel (Windows 10 + Office 2016). The special symbols appear garbled (e.g. –) but I can correct that with a "find-and-replace" macro. Annoying but workable.
However, I just got a new laptop also with Windows 10 + Office 2016 but there, the special symbols appear as white question marks on black diamonds (��). When I open the same files on the old PC I still get the good old garbled (but fixable) special symbols.
I have checked every setting I can think of but cannot find any difference between the 2 PCs. Does anyone have an idea what could be causing this and how to fix it?
Thanks!
The "garbled characters" in the old laptop are UTF-8-encoded file data decoded as (probably) Windows-1252 encoding. It seems like the new laptop is using a different default encoding.
If you export your CSV files as UTF-8 w/ BOM and Excel will display them properly without "find-and-replace". If Shopware doesn't have the option to export as UTF-8 w/ BOM, you can use an editor like NotePad++ to load the UTF-8-encoded CSV and re-save it as UTF-8 w/ BOM.
The UTF-16 encoding should also work if that is an option for export.
The culprit was an optional beta setting under Control panel / Clock and Region / Administrative / Change System locale => Beta: Use Unicode UTF-8 for worldwide language support. Once I unchecked the box, the �� disappeared and everything was back to normal.
The next part of the solution is to open the CSV files with a text editor, e.g. Notepad, and save them with UTF-8 w/ BOM encoding. After doing that, the special characters appear correctly in Excel, eliminating the need for "find and replace".
Big thanks to Mark Tolonen + Skomisa for pointing me in the right direction.
Working with data in Guinea there are administrative boundaries with special characters, specifically:
Guéckédou
The CSV was apparently created/edited in Mac and Notepad++ detects it as ANSI and displays correctly on my Windows machine. It's working fine in Excel too but in NetLogo, for example when printing in the Command Center, it's:
Gu�ck�dou
Creating a CSV in notepad++ where UTF-8 is enforced works with the csv extension.
Not sure if it's related but a somewhat similar problem seems to be when exporting the world to csv and opening in Excel- it gives the following for a perfectly fine string that was added as attribute to patches using the gis extension:
Guéckédou
Is there a way to consume that CSV other than converting it to UTF-8?
I have hard-coded Chinese characters in peoplecode. It is written to a CSV file. This CSV file is attached via an email notification. However, when the user receives the email and opens the CSV file attachment, the Chinese characters are being shown as some weird symbols or characters. I am using app engine by the way that uses PSUNX.
Anyone have any workaround about this?
The problem appears to be that you are not writing the same character set that your recipient is opening the file with. Since you are using UTF8, your choice does support the Chinese characters.
I see you have a couple options:
Find out the character set your recipient is using and use that character set when writing the file.
Educate the recipient that the file is in UTF8 and that they may need to open it differently. Here is a link on how to open a CSV using UTF8 in Excel.
Alright managed to solve it using UTF8BOM.
I am working on a Talend Project, Where we are Transforming data from 1000's of XML files to CSV and we are creating CSV file encoding as UTF-8 from Talend itself.
But the issue is that some of the Files are created as UTF-8 and some of them created as ASCII , I am not sure why this is happening The files should always be created as UTF.
As mentioned in the comments, UTF8 is a superset of ASCII. This means that the code point for any ASCII characters will be the same in UTF8 as ASCII.
Any program identifying a file containing only ASCII characters will then simply assume it is ASCII encoded. It is only when you include characters outside of the ASCII character set that the file may be recognised by whatever heuristic the reading program uses.
The only exception to this is for file types that specifically state their encoding. This includes things like (X)HTML and XML which typically start with an encoding declaration.
You can go to the Advanced tab of the tFileOutputDelimited (or other kind of tFileOutxxx) you are using and select UTF-8 encoding.
Here is an image of the advanced tab where to perform the selection
I am quite sure the unix file util makes assumptions based on content of the file being in some range and or having specific start (magic numbers). In your case if you generate a perfectly valid UTF-8 file, but you just use only the ASCII subset the file util will probably flag it as ASCII. In that event you are fine, as you have a valid UTF-8 file. :)
To force talend to get a file as you wish, you can add an additional column to your file (for example in a tMap) and set an UTF-8 character in this column. The generated file will be in UTF8 as the other repliers mentioned.
My table needs to support pretty much all characters (Japanese, Danish, Russian, etc.)
However, while saving the 2-columned table as CSV from Excel with UTF-8 encoding, then importing it with phpMyAdmin with UTF-8 encoding selected, a lot of the original characters go missing (the ones with special properties such as umlauts, accents, etc.) Also, anything following problematic characters is removed entirely. I haven't the slightest idea what is causing this problem.
EDIT: For those that come upon the same issue, I'd suggest opening your CSV file in Notepad++ and going to "Encoding > Convert to UTF-8" (not "Encode in UTF-8") first. Then import it. It will surely work.
I found an answer here:
https://help.salesforce.com/apex/HTViewSolution?id=000003837
Bascially save as a unicode text file from excel,
then replace all tabs with commas in code friendly text editor,
re-save as utf8
change file from .txt to .csv
exporting directly from excel to .csv causes problems with Japanese, this is why I went searching for help...