Issue in databricks mechanism when exporting CSV with greek characters - csv

In azure-databricks ​i have a spark dataframe with greek characters in some columns. When i display the dataframe the characters are presented correctly. However, when i choose to download the csv with the dataframe from the databricks UI, the csv file that is created doesnt contain the greek characters but instead, it contains strange symbols and signs. There appears to be a problem with the encoding.Also, i tried to create the csv with the following python code:
df.write.csv("FileStore/data.csv",header=True)
​but the same thing happens since there is no encoding option for pyspark. It appears that i cannot choose the encoding. Also, the dataframe is saved as one string and the rows are not separated by a newline. ​Is there any workaround this problem? Thank you.

Encoding is supported by pyspark !
For example when I read a file :
spark.read.option("delimiter", ";").option("header", "true").option("encoding", "utf-8").csv("xxx/xxx.csv")
Now you just have to chose the correct encoding for greek characters. It's also possible that whatever console/software you use to check your input doesn't read utf-8 by default.

Related

How do you fix the following error? I am trying to Data Table Import Wizard to load a csv file into Workbench

I am trying to upload a .csv file into Workbench using the Table Data Import Wizard.
I receive the following error whenever attempting to load it:
Unhandled exception: 'ascii' codec can't decode byte 0xc3 in position 1253: ordinal not in range(128)
I have tried previous solutions that suggested I encode the .csv file as a MS-DOS csv and as a UTF-8 csv. Neither have worked for me.
Attempting to change the data in the file would not be feasible since its made up of thousands of cells, so it would quite impractical. Is there anything that can be done to resolve this?
What was after the C3? What should have been there?
C3, when interpreted as "latin1" is à -- an unlikely character.
More likely is a 2-byte UTF-8 code that starts with C3. This includes the accented letters of Western European languages. Example é, hex C3A9.
You tried "UTF-8 csv" -- Please provide the specifics of how you tried it. What settings in the Wizard, etc.
Probably you should state that the data is "UTF-8" or utf8mb4, depending on whether you are referring to outside or inside MySQL.
Meanwhile, if you are loading the data into an existing "table", let's see SHOW CREATE TABLE. It should probably not say "ascii" anywhere; instead, it should probably say "utf8mb4".

Apache NiFi - All the spanish characters (ñ, á, í, ó, ú) in CSV changed to question mark (?) in JSON

I've fetched the CSV file using GetFile processor where CSV have spanish characters (ñ, á, í, ó, ú and more) within the English Words.
When I try to use ConvertRecord processor with controller service of JSONRecordSetWriter, it displays the JSON output having question mark instead of special characters.
What is the correct way to convert CSV records into JSON format with proper encoding?
Any response/feedback will be much appreciated.
Note: CSV File is UTF-8 encoded and fetched and read properly in NiFi.
If you have verified that the input is UTF-8, try this:
Open $NIFI/conf/bootstrap.conf
Add an argument very early in the list of arguments for -Dfile.encoding=UTF-9 to force the JVM to not use the OS's settings. This has mainly been a problem in the past with the JVM on Windows.

Reading CSV file with Chinese Character [One character cannot be shown]

When I am opening a csv file containing Chinese characters, using Microsoft Excel, TextWrangler and Sublime Text, there are some Chinese words, which cannot be displayed properly. I have no ideas why this is the case.
Specifically, the csv file can be found in the following link: https://www.hkex.com.hk/eng/plw/csv/List_of_Current_SEHK_EP.CSV
One of the word that cannot be displayed correctly is shown here:
As you can see a ? can be found.
Using mac file command as suggested by
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/ tell me that the csv format is utf-16le.
I am wondering what's the problem, why I cannot read that specific text?
Is it related to encoding? Or is it related to my laptop setting? Trying to use Mac and windows 10 on Mac (via Parallel Desktop) cannot display the work correctly.
Thanks for the help. I really want to know why this specific text cannot be displayed properly.
The actual name of HSBC Broking Securities is:
滙豐金融證券(香港)有限公司
The first character, U+6ED9 滙, is one of the troublesome HKSCS characters: characters that weren't available in standard pre-Unicode Big-5, which were grafted on in incompatible ways later.
For a while there was an unfortunate convention of converting these characters into Private Use Area characters when converting to Unicode. This data was presumably converted back then and is now mangled, replacing 滙 with U+E05E  Private Use Area Character.
For PUA cases that you're sure are the result of HKSCS-compatibility-bodge, you can convert back to proper Unicode using this table.

Convert Excel sheet having shivaji fonts to csv having google unicode utf-8 encoding

I have an Excel file which uses Shivaji fonts and I want to convert it to csv file using google's unicode utf-8 encoding.
I have tried all the methods but I didn't get any result because when saving file as csv it shows ? symbols I need are something like this: à¤Âकटा.
I need to import this file to mysql.
You say "Shivaji font" -- I assume that is for Devanagari characters?
I think what you have has been misconverted twice.
I simulated it by misconverting इंग्लिश to इंगà¥%C2%8Dलिश and then misconverting that to à ¤‡à ¤‚à ¤—à ¥Âà ¤²à ¤¿à ¤¶, which looks a lot like what you have.
When you copied the file around it got converted from utf8 to latin1 twice. Figure out the steps you took in more detail. If you can get a dump at any stage, do so. The hex Devanagari characters looks like E0A4yy where yy is between 80 and B1.

Google search export: JSON wrong diacritics

I have downloaded history of my Google search from here but the diacritics (latin-extended characters) in JSON files (encoded in utf-8) are messed up.
E.g.:
dva na ôsmu
displays as
dva na �smu
and when I use JSON intedation package in Sublime Text, I get this:
dva na \ufffdsmu
All the special characters are replaced with this same broken character. Is there any way how to fix this, is simply Google exporting broken JSONs so non-english users can't use this export? I want to build app that will display statistics of words used in my searches but it is now possible with JSONs broken this way.
The JSON seems to be corrupt. I inspected the text bytes with hex dump and the character is encoded as 0xEFBFBD, which is unicode replacement character. The letter is already lost in the JSON and the character there is the replacement character.