Google search export: JSON wrong diacritics - json

I have downloaded history of my Google search from here but the diacritics (latin-extended characters) in JSON files (encoded in utf-8) are messed up.
E.g.:
dva na ôsmu
displays as
dva na �smu
and when I use JSON intedation package in Sublime Text, I get this:
dva na \ufffdsmu
All the special characters are replaced with this same broken character. Is there any way how to fix this, is simply Google exporting broken JSONs so non-english users can't use this export? I want to build app that will display statistics of words used in my searches but it is now possible with JSONs broken this way.

The JSON seems to be corrupt. I inspected the text bytes with hex dump and the character is encoded as 0xEFBFBD, which is unicode replacement character. The letter is already lost in the JSON and the character there is the replacement character.

Related

Apache NiFi - All the spanish characters (ñ, á, í, ó, ú) in CSV changed to question mark (?) in JSON

I've fetched the CSV file using GetFile processor where CSV have spanish characters (ñ, á, í, ó, ú and more) within the English Words.
When I try to use ConvertRecord processor with controller service of JSONRecordSetWriter, it displays the JSON output having question mark instead of special characters.
What is the correct way to convert CSV records into JSON format with proper encoding?
Any response/feedback will be much appreciated.
Note: CSV File is UTF-8 encoded and fetched and read properly in NiFi.
If you have verified that the input is UTF-8, try this:
Open $NIFI/conf/bootstrap.conf
Add an argument very early in the list of arguments for -Dfile.encoding=UTF-9 to force the JVM to not use the OS's settings. This has mainly been a problem in the past with the JVM on Windows.

Issue in databricks mechanism when exporting CSV with greek characters

In azure-databricks ​i have a spark dataframe with greek characters in some columns. When i display the dataframe the characters are presented correctly. However, when i choose to download the csv with the dataframe from the databricks UI, the csv file that is created doesnt contain the greek characters but instead, it contains strange symbols and signs. There appears to be a problem with the encoding.Also, i tried to create the csv with the following python code:
df.write.csv("FileStore/data.csv",header=True)
​but the same thing happens since there is no encoding option for pyspark. It appears that i cannot choose the encoding. Also, the dataframe is saved as one string and the rows are not separated by a newline. ​Is there any workaround this problem? Thank you.
Encoding is supported by pyspark !
For example when I read a file :
spark.read.option("delimiter", ";").option("header", "true").option("encoding", "utf-8").csv("xxx/xxx.csv")
Now you just have to chose the correct encoding for greek characters. It's also possible that whatever console/software you use to check your input doesn't read utf-8 by default.

Reading CSV file with Chinese Character [One character cannot be shown]

When I am opening a csv file containing Chinese characters, using Microsoft Excel, TextWrangler and Sublime Text, there are some Chinese words, which cannot be displayed properly. I have no ideas why this is the case.
Specifically, the csv file can be found in the following link: https://www.hkex.com.hk/eng/plw/csv/List_of_Current_SEHK_EP.CSV
One of the word that cannot be displayed correctly is shown here:
As you can see a ? can be found.
Using mac file command as suggested by
http://osxdaily.com/2015/08/11/determine-file-type-encoding-command-line-mac-os-x/ tell me that the csv format is utf-16le.
I am wondering what's the problem, why I cannot read that specific text?
Is it related to encoding? Or is it related to my laptop setting? Trying to use Mac and windows 10 on Mac (via Parallel Desktop) cannot display the work correctly.
Thanks for the help. I really want to know why this specific text cannot be displayed properly.
The actual name of HSBC Broking Securities is:
滙豐金融證券(香港)有限公司
The first character, U+6ED9 滙, is one of the troublesome HKSCS characters: characters that weren't available in standard pre-Unicode Big-5, which were grafted on in incompatible ways later.
For a while there was an unfortunate convention of converting these characters into Private Use Area characters when converting to Unicode. This data was presumably converted back then and is now mangled, replacing 滙 with U+E05E  Private Use Area Character.
For PUA cases that you're sure are the result of HKSCS-compatibility-bodge, you can convert back to proper Unicode using this table.

Convert Excel sheet having shivaji fonts to csv having google unicode utf-8 encoding

I have an Excel file which uses Shivaji fonts and I want to convert it to csv file using google's unicode utf-8 encoding.
I have tried all the methods but I didn't get any result because when saving file as csv it shows ? symbols I need are something like this: à¤Âकटा.
I need to import this file to mysql.
You say "Shivaji font" -- I assume that is for Devanagari characters?
I think what you have has been misconverted twice.
I simulated it by misconverting इंग्लिश to इंगà¥%C2%8Dलिश and then misconverting that to à ¤‡à ¤‚à ¤—à ¥Âà ¤²à ¤¿à ¤¶, which looks a lot like what you have.
When you copied the file around it got converted from utf8 to latin1 twice. Figure out the steps you took in more detail. If you can get a dump at any stage, do so. The hex Devanagari characters looks like E0A4yy where yy is between 80 and B1.

What character encoding is this in the JSON file?

I downloaded a 6GB gz file from the openlibrary, I extracted it on my ubuntu machine which turned into a 40GB txt file. When inspecting the head of the file using head, I find this string:
"name": "Mawlu\u0304d Qa\u0304sim Na\u0304yit Bulqa\u0304sim"
What encoding is this? Is it possible to get something that is human readable or does it look like it will require the data source to be exported correctly again?
It's standard escaping of unicode characters in a javascript literal string.
the string is Mawlūd Qāsim Nāyit Bulqāsim
This is plain JSON encoding. Your JSON parser will translate the \uNNNN references to Unicode characters. See also: json_encode function: special characters
looks like unicode
http://www.charbase.com/0304-unicode-combining-macron
U+0304: COMBINING MACRON