What character encoding is this in the JSON file? - json

I downloaded a 6GB gz file from the openlibrary, I extracted it on my ubuntu machine which turned into a 40GB txt file. When inspecting the head of the file using head, I find this string:
"name": "Mawlu\u0304d Qa\u0304sim Na\u0304yit Bulqa\u0304sim"
What encoding is this? Is it possible to get something that is human readable or does it look like it will require the data source to be exported correctly again?

It's standard escaping of unicode characters in a javascript literal string.
the string is Mawlūd Qāsim Nāyit Bulqāsim

This is plain JSON encoding. Your JSON parser will translate the \uNNNN references to Unicode characters. See also: json_encode function: special characters

looks like unicode
http://www.charbase.com/0304-unicode-combining-macron
U+0304: COMBINING MACRON

Related

Apache NiFi - All the spanish characters (ñ, á, í, ó, ú) in CSV changed to question mark (?) in JSON

I've fetched the CSV file using GetFile processor where CSV have spanish characters (ñ, á, í, ó, ú and more) within the English Words.
When I try to use ConvertRecord processor with controller service of JSONRecordSetWriter, it displays the JSON output having question mark instead of special characters.
What is the correct way to convert CSV records into JSON format with proper encoding?
Any response/feedback will be much appreciated.
Note: CSV File is UTF-8 encoded and fetched and read properly in NiFi.
If you have verified that the input is UTF-8, try this:
Open $NIFI/conf/bootstrap.conf
Add an argument very early in the list of arguments for -Dfile.encoding=UTF-9 to force the JVM to not use the OS's settings. This has mainly been a problem in the past with the JVM on Windows.

Dump Chinese data into a json file

I am falling on a problem, while dumping a chinese data (non-latin language data) into a json file.
I am trying to store list into a json file with the following code;
with open("file_name.json","w",encoding="utf8") as file:
json.dump(edits,file)
It will dumped without any errors.
When i am viewing a file, it will look like this,
[{sentence: \u5979\u7d30\u5c0f\u8072\u5c0d\u6211\u8aaa\uff1a\u300c\u6211\u501f\u4f60\u4e00\u679d\u925b\u7b46\u3002\u300d}...]
And I also tried out, without encoding option.
with open("file_name.json","w") as file:
json.dump(edits,file)
My question is, why my json file look like this, and how to dump my json file with having chinese string instead of unicode string.
Any helps would be appreciated. Thanks : )
Check out the docs for json.dump.
Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.

Issue in databricks mechanism when exporting CSV with greek characters

In azure-databricks ​i have a spark dataframe with greek characters in some columns. When i display the dataframe the characters are presented correctly. However, when i choose to download the csv with the dataframe from the databricks UI, the csv file that is created doesnt contain the greek characters but instead, it contains strange symbols and signs. There appears to be a problem with the encoding.Also, i tried to create the csv with the following python code:
df.write.csv("FileStore/data.csv",header=True)
​but the same thing happens since there is no encoding option for pyspark. It appears that i cannot choose the encoding. Also, the dataframe is saved as one string and the rows are not separated by a newline. ​Is there any workaround this problem? Thank you.
Encoding is supported by pyspark !
For example when I read a file :
spark.read.option("delimiter", ";").option("header", "true").option("encoding", "utf-8").csv("xxx/xxx.csv")
Now you just have to chose the correct encoding for greek characters. It's also possible that whatever console/software you use to check your input doesn't read utf-8 by default.

Json parsing with unicode characters

i have a json file with unicode characters, and i'm having trouble to parse it. I've tried in Flash CS5, the JSON library, and i have tried it in http://json.parser.online.fr/ and i always get "unexpected token - eval fails"
I'm sorry, there realy was a problem with the syntax, it came this way from the client.
Can someone please help me? Thanks
Quoth the RFC:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
So a correctly encoded Unicode character should not be a problem. Which leads me to believe that it's not correctly encoded (maybe it uses latin-1 instead of UTF-8). How did you create the file? In a text editor?
There might be an obscure Unicode whitespace character hidden in your string.
This URL contains more detail:
http://timelessrepo.com/json-isnt-a-javascript-subset
In asp.net you would think you would use System.Text.Encoding to convert a string like "Paul\u0027s" back to a string like "Paul's" but i tried for hours and found nothing that worked.
The trouble is hardcoding a string as shown above already decodes the string as you will see if you put a break point on it so in the end i wrote a function that converts the Hex27 to Dec39 so that i ended up with HTML encodeing and then decoded that.
string Padding = "000";
for (int f = 1; f <= 256; f++)
{
string Hex = "\\u" + Padding.Substring(0, 4 - f.ToString().Length) + f;
string Dec = "&#" + Int32.Parse(f.ToString(), NumberStyles.HexNumber) + ";";
HTML = HTML.Replace(Hex, Dec);
}
HTML = System.Web.HttpUtility.HtmlDecode(HTML);
Ugly as sin, I know but without using the latest framework (Not on ISP's server) it was the best I could do and someone must know a better solution.
I had the same problem and I just change the file encoding type Mac-Roman/windows-1252 to UTF-8.. and it worked
I had the same problem with Twitter json files. I was parsing them in Python with json.loads(tweet) but it failed for half of the records.
I changed to Python3 and it works well now.
If you seem to have trouble with the encoding of a JSON file (i.e. escaped codes such as \u00fc aren't displayed correctly regardless of your editor's encoding setting) generated by Python with json.dump s(): it encodes ASCII by default and escapes the unicode characters! See python json unicode - how do I eval using javascript (and python: json.dumps can't handle utf-8? and Why does json.dumps escape non-ascii characters with "\uxxxx").

UTF-8 Character Encoding in SQL

I am trying out Bullzip's Access to mySQL app on an Access DB full of special chars like é and ä. The app allows you to specify UTF-8 encoding but in the resulting SQL file I get "Vieux Carré" instead of "Vieux Carré".
I tried opening the SQL file in UltraEdit and doing a UTF-8 conversion but it does not resolve this issue as I guess it is converting "é" and never sees the "é"?
What is a Good™ solution for this?
The problem is in the UTF-8 to Unicode conversion into or out of Access. Access, like SQL Server, can only natively store data in ASCII format or Unicode (UTF-16) (With Unicode compression off). In order to ensure a given value was stored properly, you would need to convert it to Unicode on storage and convert it back to UTF-8 on retrieval. You may be able to use the StrConv function for such a purpose.
I have the same problem with Bullzip convertor now, so it could still help someone.
It doesn´t show the special characters right if I have my pc language set to english. I have to switch it back to czech (language of the special characters) and it works now and SQL looks correct.