I am falling on a problem, while dumping a chinese data (non-latin language data) into a json file.
I am trying to store list into a json file with the following code;
with open("file_name.json","w",encoding="utf8") as file:
json.dump(edits,file)
It will dumped without any errors.
When i am viewing a file, it will look like this,
[{sentence: \u5979\u7d30\u5c0f\u8072\u5c0d\u6211\u8aaa\uff1a\u300c\u6211\u501f\u4f60\u4e00\u679d\u925b\u7b46\u3002\u300d}...]
And I also tried out, without encoding option.
with open("file_name.json","w") as file:
json.dump(edits,file)
My question is, why my json file look like this, and how to dump my json file with having chinese string instead of unicode string.
Any helps would be appreciated. Thanks : )
Check out the docs for json.dump.
Specifically, it has a switch ensure_ascii that if set to False should make the function not escape the characters.
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Related
I want to read data from csv files with two possible encodings (UTF-8 and ISO-8859-15). I mean different files with different encodings. Not the same file with two encodings.
Now I can only read data correctly from a utf-8 encoding file. Can I just implement this by adding an extra option? For example . encoding: 'ISO-8859-15'
What i have:
def csv
file = File.open(file.tempfile)
CSV.open(file, csv_options)
end
private
def csv_options
{
col_sep: ";",
headers: true,
return_headers: false,
skip_blanks: true
}
end
Once you know what encoding your file has, you can pass inside the CSV options i.e.
external_encoding: Encoding::ISO_8859_15,
internal_encoding: Encoding::UTF_8
(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).
So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.
In azure-databricks i have a spark dataframe with greek characters in some columns. When i display the dataframe the characters are presented correctly. However, when i choose to download the csv with the dataframe from the databricks UI, the csv file that is created doesnt contain the greek characters but instead, it contains strange symbols and signs. There appears to be a problem with the encoding.Also, i tried to create the csv with the following python code:
df.write.csv("FileStore/data.csv",header=True)
but the same thing happens since there is no encoding option for pyspark. It appears that i cannot choose the encoding. Also, the dataframe is saved as one string and the rows are not separated by a newline. Is there any workaround this problem? Thank you.
Encoding is supported by pyspark !
For example when I read a file :
spark.read.option("delimiter", ";").option("header", "true").option("encoding", "utf-8").csv("xxx/xxx.csv")
Now you just have to chose the correct encoding for greek characters. It's also possible that whatever console/software you use to check your input doesn't read utf-8 by default.
Wrote a function in my web application based on eXist-db to export some xml elements to csv with XQuery. Everything works fine but I have some umlauts like ü, ä or ß in my elements which are displayed the wrong way in my csv. I tried to encode the content by using fn:normalize-unicode but this is not working.
Here is a minimalized example of my code snippet:
let $input =
<root>
<number>1234</number>
<name>Aufmaß</name>
</root>
let $csv := string-join(
for $ta in $input
return concat($ta/number/text(), fn:normalize-unicode($ta/name/text())))
let $csv-ueber-string := concat($csv-ueber, string-join($massnahmen, $nl))
let $set-content-type := response:set-header('Content-Type', 'text/csv')
let $set-accept := response:set-header('Accept', 'text/csv')
let $set-file-name := response:set-header('Content-Disposition', 'attachment; filename="export.csv"')
return response:stream($csv, '')
It's very unlikely indeed that there's anything wrong with your query, or that there's anything you can do in your query to correct this.
The problem is likely to be either
(a) the input data being passed to your query is in a different character encoding from what the query processor thinks it is
(b) the output data from your query is in a different character encoding from what the recipient of the output thinks it is.
A quick glance at your query suggests that it doesn't actually have any external input other that the query source code itself. But the source code is one of the inputs, and that's a possible source of error. A good way to eliminate this possibility might be to see what happens if you replace
<name>Aufmaß</name>
by
<name>Aufma{codepoints-to-string(223)}</name>
If that solves the problem, then your query source text is not in the encoding that the query compiler thinks it is.
The other possibility is that the problem is on the output side, and frankly, this seems more likely. You seem to be producing an HTTP response stream as output, and constructing the HTTP headers yourself. I don't see any evidence that you are setting any particular encoding in the HTTP response headers. The response:stream() function is vendor-specific and I'm not familiar with its details, but I suspect that you need to ensure it encodes the content in UTF-8 and that the HTTP headers say it is in UTF-8; this may be by extra parameters to the function, or by external configuration options.
As you might expect, eXist is serializing the CSV as Unicode (UTF-8). But when you open the resulting export.csv file directly in Excel (i.e., via File > Open), Excel will try its best to guess the encoding of the CSV file. But CSV files lack any way of declaring their encoding, so applications may well guess wrong, as it sounds like Excel did in your case. On my computer, Excel guesses wrong too, mangling the encoding of Aufmaß as Aufmaß. Here's the way to force Excel to use the encoding of a UTF-8 encoded CSV file such as the one produced by your query.
In Excel, start a new spreadsheet via File > New
Select File > Import to bring up a series of dialogs that let you specify how to import the CSV file.
In the first dialog, select "CSV file" as the type of file.
In the next dialog, titled "Text Import Wizard -
Step 1 of 3", select "Unicode (UTF-8)" as the "File origin." (At least these are the titles/order in my copy of MS Excel for Mac 2016).
Proceed through the remainder of the dialogs, keeping the default values.
Excel will then place the contents of your export.csv in the new spreadsheet.
Lastly, let me provide the following query I used to test and confirm that the CSV file produced by eXist does open as expected when following the directions above. The query is essentially the same as yours but fixes some problems in your query that prevented me from running it directly. I saved this query at /db/csv-test.xq and called it via http://localhost:8080/exist/rest/db/csv-test.xq,
xquery version "3.1";
let $input :=
<root>
<number>1234</number>
<name>Aufmaß</name>
</root>
let $cell-separator := ","
let $column-headings := $input/*/name()
let $header-row := string-join($column-headings, $cell-separator)
let $body-row := string-join($input/*/string(), $cell-separator)
let $newline := '
'
let $csv := string-join(($header-row, $body-row), $newline)
return
response:stream-binary(
util:string-to-binary($csv),
"text/csv",
"export.csv"
)
I downloaded a 6GB gz file from the openlibrary, I extracted it on my ubuntu machine which turned into a 40GB txt file. When inspecting the head of the file using head, I find this string:
"name": "Mawlu\u0304d Qa\u0304sim Na\u0304yit Bulqa\u0304sim"
What encoding is this? Is it possible to get something that is human readable or does it look like it will require the data source to be exported correctly again?
It's standard escaping of unicode characters in a javascript literal string.
the string is Mawlūd Qāsim Nāyit Bulqāsim
This is plain JSON encoding. Your JSON parser will translate the \uNNNN references to Unicode characters. See also: json_encode function: special characters
looks like unicode
http://www.charbase.com/0304-unicode-combining-macron
U+0304: COMBINING MACRON
I made a JSON request that gives me a string that uses Unicode character codes that looks like:
s = "\u003Cp\u003E"
And I want to convert it to:
s = "<p>"
What's the best way to do this in Python?
Note, this is the same question as this one, only in Python except Ruby. I am also using the Posterous API.
>>> "\\u003Cp\\u003E".decode('unicode-escape')
u'<p>'
If the data came from JSON, the json module should already have decoded these escapes for you:
>>> import json
>>> json.loads('"\u003Cp\u003E"')
u'<p>'
EDIT: The original question "Unescaping Characters in a String with Python" did not clarify if the string was to be written or to be read (later on, the "JSON response" words were added, to clarify the intention was to read).
So I answered the opposite question: how to write JSON serialized data dumping them to a unescaped string (rather than loading data from the string).
My use case was producing a JSON file from my own data dictionary, but the file contained scaped non-ASCII characters. So I did it like this:
with open(filename,'w') as jsonfile:
jsonstr = json.dumps(myDictionary, ensure_ascii=False)
print(jsonstr) # to screen
jsonfile.write(jsonstr) # to file
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Taken from here: https://docs.python.org/3/library/json.html