Encoding ASCII characters as UTF-8 json normalize pandas - json

I am using pandas json_normalize library to parse some data, but getting some ASCII data like this:
Is there any parameter to add when pd.json_normalize(data) in order to get the data in the correct encoding?

Related

pyspark dataframe to valid json

Im trying to convert a dataframe to a valid json format, howeever I have not succeeded yet.
if I do like this:
fullDataset.repartition(1).write.json(f'{mount_point}/eds_ckan', mode='overwrite', ignoreNullFields=False)
I only get row based json like this:
{"col1":"2021-10-09T12:00:00.000Z","col2":336,"col3":0.0}
{"col1":"2021-10-16T20:00:00.000Z","col2":779,"col3":6965.396}
{"col1":"2021-10-17T12:00:00.000Z","col2":350,"col3":0.0}
Does anyone know how to convert it to valid json which is not row based?
Below is the sample example on converting dataframe to valid Json
Try using Collect and then using json.dump
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
here are few links with the related discussions you can go through for complete information.
Dataframe to valid JSON
Valid JSON in spark

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|" (with ParseOptions) and auto_dict_encode=True with (ConvertOptions).
How is pyarrow handling different encoding types?
pyarrow currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.
A small example:
# writing a small file with latin encoding
with open("test.csv", "w", encoding="latin") as f:
f.writelines(["col1,col2\n", "u,ù"])
Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:
>>> from pyarrow import csv
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary
With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):
>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
>>> pd.read_csv("test.csv", encoding="latin")
col1 col2
0 u ù
It's now possible to specify encodings with pyarrow.read_csv.
According to the pyarrow docs for read_csv:
The encoding can be changed using the ReadOptions class.
A minimal example follows:
from pyarrow import csv
options = csv.ReadOptions(encoding='latin1')
table = csv.read_csv('path/to/file', options)
From what I can tell, the functionality was added in this PR, so it should work starting with pyarrow 1.0.

Creating a Pandas DataFrame from a CSV file with JSON in it

I have a Postgres database where two columns are jsonb data. I used this command to get a CSV copy of the database: \copy (SELECT * FROM articles) TO articles.csv CSV DELIMITER ‘,’ HEADER
I am using Python 3.6. When I load this CSV file into a Pandas dataframe with read_csv I get what appears to be a doubly encoded string for all the json data:
e.g. articles.iloc[0]['word_count'] gives me:
'"{\\"he\\":8,\\"is\\":8,\\"a\\":26,\\"wealthy\\":1,\\"international\\":2,\\"entrepreneur\\":1,\\"known\\":3,\\"for\\":9,\\"generous\\":1,\\"donations\\":2,\\"to\\":17,\\"his\\":6,\\"alma\\":1,\\"mater\\":1,\\"harvard\\":11,\\"now\\":2,\\"court\\":12,\\"says\\":1,\\"the\\":51,\\"university\\":3,\\"must\\":2,\\"cooperate\\":1,\\"in\\":21,\\"hunt\\":1,\\"assets\\":3,\\"federal\\":2,\\"judge\\":2,\\"boston\\":3,\\"has\\":4,\\"ruled\\":2,\\"that\\":10,\\"provide\\":1,\\"testimony\\":1,\\"and\\":11,\\"produce\\":1,\\"documents\\":3,\\"disclosing\\":1,\\"bank\\":1,\\"accounts\\":1,\\"routing\\":1,\\"numbers\\":1,\\"wire\\":1,\\"transfers\\":1,\\"other\\":2,\\"interbank\\":1,\\"messages\\":1,\\"used\\":1,\\"by\\":11,\\"an\\":6,\\"alumnus\\":1,\\"charles\\":1,\\"c\\":2,\\"spackman\\":19,\\"send\\":1,\\"money\\":2,\\"mr\\":19,\\"hong\\":5,\\"kongbased\\":1,\\"businessman\\":1,\\"leads\\":2,\\"group\\":4,\\"global\\":1,\\"investment\\":1,\\"holding\\":1,\\"company\\":10,\\"with\\":3,\\"billion\\":1,\\"under\\":1,\\"management\\":1,\\"ruling\\":4,\\"places\\":1,\\"ivy\\":1,\\"league\\":1,\\"college\\":1,\\"uncomfortable\\":1,\\"predicament\\":1,\\"of\\":19,\\"revealing\\":1,\\"confidential\\":1,\\"financial\\":1,\\"information\\":3,\\"gleaned\\":1,\\"from\\":2,\\"influential\\":1,\\"benefactor\\":1,\\"no\\":2,\\"small\\":1,\\"donor\\":1,\\"according\\":2,\\"website\\":1,\\"sponsors\\":1,\\"scholarship\\":2,\\"fund\\":1,\\"asian\\":1,\\"students\\":1,\\"at\\":2,\\"harvardasia\\":1,\\"council\\":1,\\"served\\":1,\\"as\\":2,\\"cochairman\\":1,\\"reunion\\":1,\\"gifts\\":1,\\"class\\":1,\\"year\\":1,\\"also\\":2,\\"korean\\":6,\\"name\\":1,\\"yoo\\":1,\\"shin\\":1,\\"choi\\":1,\\"obtained\\":1,\\"undergraduate\\":1,\\"degree\\":1,\\"economics\\":1,\\"spokeswoman\\":1,\\"melodie\\":1,\\"jackson\\":1,\\"said\\":8,\\"would\\":2,\\"not\\":5,\\"comment\\":2,\\"on\\":4,\\"order\\":1,\\"part\\":1,\\"longfought\\":1,\\"quest\\":1,\\"aggrieved\\":1,\\"investor\\":2,\\"sang\\":1,\\"cheol\\":1,\\"woo\\":3,\\"collect\\":2,\\"judgment\\":4,\\"against\\":1,\\"involving\\":1,\\"south\\":4,\\"business\\":3,\\"deal\\":1,\\"case\\":3,\\"could\\":2,\\"have\\":3,\\"furtherreaching\\":1,\\"implications\\":1,\\"douglas\\":1,\\"kellner\\":2,\\"manhattan\\":1,\\"lawyer\\":2,\\"who\\":2,\\"specializes\\":1,\\"recovering\\":1,\\"hidden\\":1,\\"worldwide\\":1,\\"if\\":2,\\"diverted\\":1,\\"funds\\":1,\\"when\\":1,\\"should\\":1,\\"been\\":3,\\"paying\\":1,\\"thats\\":1,\\"fraudulent\\":1,\\"transfer\\":1,\\"they\\":2,\\"sue\\":2,\\"get\\":2,\\"back\\":2,\\"theyd\\":1,\\"be\\":1,\\"entitled\\":1,\\"it\\":4,\\"can\\":1,\\"show\\":1,\\"was\\":6,\\"fraudulently\\":1,\\"transferred\\":1,\\"john\\":1,\\"han\\":1,\\"firm\\":2,\\"kobre\\":1,\\"kim\\":1,\\"which\\":5,\\"handling\\":1,\\"investors\\":1,\\"had\\":3,\\"plans\\":1,\\"unwittingly\\":1,\\"entangled\\":1,\\"dispute\\":1,\\"collection\\":1,\\"effort\\":1,\\"dates\\":1,\\"stock\\":2,\\"collapse\\":2,\\"littauer\\":2,\\"technologies\\":1,\\"ltd\\":1,\\"technology\\":1,\\"seoul\\":1,\\"high\\":2,\\"major\\":1,\\"fled\\":1,\\"korea\\":3,\\"amid\\":1,\\"claims\\":1,\\"price\\":1,\\"manipulation\\":1,\\"departing\\":1,\\"before\\":3,\\"authorities\\":2,\\"arrested\\":1,\\"partner\\":1,\\"later\\":3,\\"insiders\\":1,\\"profited\\":1,\\"selling\\":1,\\"their\\":1,\\"shares\\":1,\\"while\\":2,\\"minority\\":1,\\"shareholders\\":1,\\"including\\":1,\\"suffered\\":1,\\"enormous\\":1,\\"losses\\":1,\\"ordered\\":1,\\"pay\\":1,\\"million\\":2,\\"mushroomed\\":1,\\"because\\":5,\\"accumulating\\":1,\\"interest\\":1,\\"managing\\":1,\\"director\\":1,\\"richard\\":1,\\"lee\\":1,\\"related\\":1,\\"lawsuit\\":1,\\"pending\\":1,\\"kong\\":4,\\"filed\\":1,\\"appeared\\":1,\\"unaware\\":1,\\"until\\":2,\\"just\\":1,\\"overturned\\":1,\\"supreme\\":1,\\"all\\":1,\\"defendants\\":1,\\"except\\":1,\\"upheld\\":1,\\"him\\":1,\\"did\\":2,\\"appear\\":1,\\"defend\\":1,\\"himself\\":1,\\"acknowledging\\":1,\\"fined\\":1,\\"connection\\":1,\\"matter\\":1,\\"maintains\\":1,\\"commit\\":1,\\"offenses\\":1,\\"woos\\":1,\\"lawyers\\":1,\\"argue\\":1,\\"efforts\\":1,\\"hampered\\":1,\\"what\\":1,\\"papers\\":2,\\"called\\":1,\\"mazelike\\":1,\\"network\\":1,\\"offshore\\":1,\\"nominees\\":1,\\"trusts\\":1,\\"many\\":1,\\"are\\":1,\\"managed\\":1,\\"close\\":1,\\"family\\":1,\\"members\\":1,\\"classmates\\":1,\\"example\\":1,\\"estate\\":1,\\"where\\":1,\\"lives\\":1,\\"section\\":1,\\"forbes\\":1,\\"described\\":1,\\"wealthiest\\":1,\\"neighborhood\\":1,\\"earth\\":1,\\"owned\\":2,\\"through\\":1,\\"series\\":1,\\"shell\\":1,\\"companies\\":1,\\"turn\\":3,\\"british\\":1,\\"virgin\\":1,\\"islands\\":1,\\"say\\":1,\\"entered\\":1,\\"feb\\":1,\\"william\\":1,\\"g\\":1,\\"young\\":1,\\"district\\":1,\\"gives\\":1,\\"march\\":1,\\"over\\":2,\\"banking\\":1,\\"orders\\":1,\\"spackmans\\":2,\\"daughter\\":1,\\"claire\\":1,\\"sophomore\\":1,\\"testify\\":1,\\"records\\":1,\\"about\\":1,\\"her\\":1,\\"fathers\\":1,\\"american\\":1,\\"citizen\\":1,\\"permanent\\":1,\\"resident\\":1,\\"well\\":1,\\"partly\\":1,\\"son\\":1,\\"james\\":1,\\"adopted\\":1,\\"americans\\":1,\\"after\\":1,\\"biological\\":1,\\"parents\\":1,\\"died\\":1,\\"during\\":1,\\"war\\":1,\\"advanced\\":1,\\"world\\":1,\\"become\\":1,\\"chief\\":1,\\"prudentials\\":1,\\"insurance\\":1,\\"holdings\\":1,\\"younger\\":1,\\"include\\":1,\\"entertainment\\":1,\\"produced\\":1,\\"science\\":1,\\"fiction\\":1,\\"movie\\":1,\\"snowpiercer\\":1,\\"starring\\":1,\\"tilda\\":1,\\"swinton\\":1,\\"octavia\\":1,\\"spencer\\":1}"'
In order to get a python dictionary from the above string I have to call json.loads(json.loads()) on it. Since I want to convert the whole column to dictionaries I tried articles['word_count'].apply( lambda x: json.loads(json.loads(x)) ) but this gives me an error:
TypeError: the JSON object must be str, bytes or bytearray, not 'float'
How do I fix this? OR am I missing a command when I export to CSV from my database? OR am I missing a command when I call read_csv in Pandas?
Note: I have tried the 'converter' option with read_csv and I get this error: JSONDecodeError: Expecting value: line 1 column 1 (char 0) My function is:
def dec(s):
return json.loads( json.loads(s) )
Use pd.io.json.json_normalize() to convert an entire column of JSON data into a separate DataFrame with the same number of rows:
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.io.json.json_normalize.html
For your case it'd be something like this:
pd.io.json.json_normalize(articles.word_count)
You might have to preprocess it if Pandas doesn't understand the escaping in your input data.
Beyond all that, since your data comes from a database, you should consider just loading it directly, without the CSV intermediary. Pandas has functions for this, such as read_sql_query() and read_sql_table().

How to read .csv file that contains utf-8 values by pandas dataframe

I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0,
pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recomend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except:
df = pd.read_csv('file', encoding= "ISO-8859-1")

Unescaping Characters in a JSON response string

I made a JSON request that gives me a string that uses Unicode character codes that looks like:
s = "\u003Cp\u003E"
And I want to convert it to:
s = "<p>"
What's the best way to do this in Python?
Note, this is the same question as this one, only in Python except Ruby. I am also using the Posterous API.
>>> "\\u003Cp\\u003E".decode('unicode-escape')
u'<p>'
If the data came from JSON, the json module should already have decoded these escapes for you:
>>> import json
>>> json.loads('"\u003Cp\u003E"')
u'<p>'
EDIT: The original question "Unescaping Characters in a String with Python" did not clarify if the string was to be written or to be read (later on, the "JSON response" words were added, to clarify the intention was to read).
So I answered the opposite question: how to write JSON serialized data dumping them to a unescaped string (rather than loading data from the string).
My use case was producing a JSON file from my own data dictionary, but the file contained scaped non-ASCII characters. So I did it like this:
with open(filename,'w') as jsonfile:
jsonstr = json.dumps(myDictionary, ensure_ascii=False)
print(jsonstr) # to screen
jsonfile.write(jsonstr) # to file
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Taken from here: https://docs.python.org/3/library/json.html