I have a Postgres database where two columns are jsonb data. I used this command to get a CSV copy of the database: \copy (SELECT * FROM articles) TO articles.csv CSV DELIMITER ‘,’ HEADER
I am using Python 3.6. When I load this CSV file into a Pandas dataframe with read_csv I get what appears to be a doubly encoded string for all the json data:
e.g. articles.iloc[0]['word_count'] gives me:
'"{\\"he\\":8,\\"is\\":8,\\"a\\":26,\\"wealthy\\":1,\\"international\\":2,\\"entrepreneur\\":1,\\"known\\":3,\\"for\\":9,\\"generous\\":1,\\"donations\\":2,\\"to\\":17,\\"his\\":6,\\"alma\\":1,\\"mater\\":1,\\"harvard\\":11,\\"now\\":2,\\"court\\":12,\\"says\\":1,\\"the\\":51,\\"university\\":3,\\"must\\":2,\\"cooperate\\":1,\\"in\\":21,\\"hunt\\":1,\\"assets\\":3,\\"federal\\":2,\\"judge\\":2,\\"boston\\":3,\\"has\\":4,\\"ruled\\":2,\\"that\\":10,\\"provide\\":1,\\"testimony\\":1,\\"and\\":11,\\"produce\\":1,\\"documents\\":3,\\"disclosing\\":1,\\"bank\\":1,\\"accounts\\":1,\\"routing\\":1,\\"numbers\\":1,\\"wire\\":1,\\"transfers\\":1,\\"other\\":2,\\"interbank\\":1,\\"messages\\":1,\\"used\\":1,\\"by\\":11,\\"an\\":6,\\"alumnus\\":1,\\"charles\\":1,\\"c\\":2,\\"spackman\\":19,\\"send\\":1,\\"money\\":2,\\"mr\\":19,\\"hong\\":5,\\"kongbased\\":1,\\"businessman\\":1,\\"leads\\":2,\\"group\\":4,\\"global\\":1,\\"investment\\":1,\\"holding\\":1,\\"company\\":10,\\"with\\":3,\\"billion\\":1,\\"under\\":1,\\"management\\":1,\\"ruling\\":4,\\"places\\":1,\\"ivy\\":1,\\"league\\":1,\\"college\\":1,\\"uncomfortable\\":1,\\"predicament\\":1,\\"of\\":19,\\"revealing\\":1,\\"confidential\\":1,\\"financial\\":1,\\"information\\":3,\\"gleaned\\":1,\\"from\\":2,\\"influential\\":1,\\"benefactor\\":1,\\"no\\":2,\\"small\\":1,\\"donor\\":1,\\"according\\":2,\\"website\\":1,\\"sponsors\\":1,\\"scholarship\\":2,\\"fund\\":1,\\"asian\\":1,\\"students\\":1,\\"at\\":2,\\"harvardasia\\":1,\\"council\\":1,\\"served\\":1,\\"as\\":2,\\"cochairman\\":1,\\"reunion\\":1,\\"gifts\\":1,\\"class\\":1,\\"year\\":1,\\"also\\":2,\\"korean\\":6,\\"name\\":1,\\"yoo\\":1,\\"shin\\":1,\\"choi\\":1,\\"obtained\\":1,\\"undergraduate\\":1,\\"degree\\":1,\\"economics\\":1,\\"spokeswoman\\":1,\\"melodie\\":1,\\"jackson\\":1,\\"said\\":8,\\"would\\":2,\\"not\\":5,\\"comment\\":2,\\"on\\":4,\\"order\\":1,\\"part\\":1,\\"longfought\\":1,\\"quest\\":1,\\"aggrieved\\":1,\\"investor\\":2,\\"sang\\":1,\\"cheol\\":1,\\"woo\\":3,\\"collect\\":2,\\"judgment\\":4,\\"against\\":1,\\"involving\\":1,\\"south\\":4,\\"business\\":3,\\"deal\\":1,\\"case\\":3,\\"could\\":2,\\"have\\":3,\\"furtherreaching\\":1,\\"implications\\":1,\\"douglas\\":1,\\"kellner\\":2,\\"manhattan\\":1,\\"lawyer\\":2,\\"who\\":2,\\"specializes\\":1,\\"recovering\\":1,\\"hidden\\":1,\\"worldwide\\":1,\\"if\\":2,\\"diverted\\":1,\\"funds\\":1,\\"when\\":1,\\"should\\":1,\\"been\\":3,\\"paying\\":1,\\"thats\\":1,\\"fraudulent\\":1,\\"transfer\\":1,\\"they\\":2,\\"sue\\":2,\\"get\\":2,\\"back\\":2,\\"theyd\\":1,\\"be\\":1,\\"entitled\\":1,\\"it\\":4,\\"can\\":1,\\"show\\":1,\\"was\\":6,\\"fraudulently\\":1,\\"transferred\\":1,\\"john\\":1,\\"han\\":1,\\"firm\\":2,\\"kobre\\":1,\\"kim\\":1,\\"which\\":5,\\"handling\\":1,\\"investors\\":1,\\"had\\":3,\\"plans\\":1,\\"unwittingly\\":1,\\"entangled\\":1,\\"dispute\\":1,\\"collection\\":1,\\"effort\\":1,\\"dates\\":1,\\"stock\\":2,\\"collapse\\":2,\\"littauer\\":2,\\"technologies\\":1,\\"ltd\\":1,\\"technology\\":1,\\"seoul\\":1,\\"high\\":2,\\"major\\":1,\\"fled\\":1,\\"korea\\":3,\\"amid\\":1,\\"claims\\":1,\\"price\\":1,\\"manipulation\\":1,\\"departing\\":1,\\"before\\":3,\\"authorities\\":2,\\"arrested\\":1,\\"partner\\":1,\\"later\\":3,\\"insiders\\":1,\\"profited\\":1,\\"selling\\":1,\\"their\\":1,\\"shares\\":1,\\"while\\":2,\\"minority\\":1,\\"shareholders\\":1,\\"including\\":1,\\"suffered\\":1,\\"enormous\\":1,\\"losses\\":1,\\"ordered\\":1,\\"pay\\":1,\\"million\\":2,\\"mushroomed\\":1,\\"because\\":5,\\"accumulating\\":1,\\"interest\\":1,\\"managing\\":1,\\"director\\":1,\\"richard\\":1,\\"lee\\":1,\\"related\\":1,\\"lawsuit\\":1,\\"pending\\":1,\\"kong\\":4,\\"filed\\":1,\\"appeared\\":1,\\"unaware\\":1,\\"until\\":2,\\"just\\":1,\\"overturned\\":1,\\"supreme\\":1,\\"all\\":1,\\"defendants\\":1,\\"except\\":1,\\"upheld\\":1,\\"him\\":1,\\"did\\":2,\\"appear\\":1,\\"defend\\":1,\\"himself\\":1,\\"acknowledging\\":1,\\"fined\\":1,\\"connection\\":1,\\"matter\\":1,\\"maintains\\":1,\\"commit\\":1,\\"offenses\\":1,\\"woos\\":1,\\"lawyers\\":1,\\"argue\\":1,\\"efforts\\":1,\\"hampered\\":1,\\"what\\":1,\\"papers\\":2,\\"called\\":1,\\"mazelike\\":1,\\"network\\":1,\\"offshore\\":1,\\"nominees\\":1,\\"trusts\\":1,\\"many\\":1,\\"are\\":1,\\"managed\\":1,\\"close\\":1,\\"family\\":1,\\"members\\":1,\\"classmates\\":1,\\"example\\":1,\\"estate\\":1,\\"where\\":1,\\"lives\\":1,\\"section\\":1,\\"forbes\\":1,\\"described\\":1,\\"wealthiest\\":1,\\"neighborhood\\":1,\\"earth\\":1,\\"owned\\":2,\\"through\\":1,\\"series\\":1,\\"shell\\":1,\\"companies\\":1,\\"turn\\":3,\\"british\\":1,\\"virgin\\":1,\\"islands\\":1,\\"say\\":1,\\"entered\\":1,\\"feb\\":1,\\"william\\":1,\\"g\\":1,\\"young\\":1,\\"district\\":1,\\"gives\\":1,\\"march\\":1,\\"over\\":2,\\"banking\\":1,\\"orders\\":1,\\"spackmans\\":2,\\"daughter\\":1,\\"claire\\":1,\\"sophomore\\":1,\\"testify\\":1,\\"records\\":1,\\"about\\":1,\\"her\\":1,\\"fathers\\":1,\\"american\\":1,\\"citizen\\":1,\\"permanent\\":1,\\"resident\\":1,\\"well\\":1,\\"partly\\":1,\\"son\\":1,\\"james\\":1,\\"adopted\\":1,\\"americans\\":1,\\"after\\":1,\\"biological\\":1,\\"parents\\":1,\\"died\\":1,\\"during\\":1,\\"war\\":1,\\"advanced\\":1,\\"world\\":1,\\"become\\":1,\\"chief\\":1,\\"prudentials\\":1,\\"insurance\\":1,\\"holdings\\":1,\\"younger\\":1,\\"include\\":1,\\"entertainment\\":1,\\"produced\\":1,\\"science\\":1,\\"fiction\\":1,\\"movie\\":1,\\"snowpiercer\\":1,\\"starring\\":1,\\"tilda\\":1,\\"swinton\\":1,\\"octavia\\":1,\\"spencer\\":1}"'
In order to get a python dictionary from the above string I have to call json.loads(json.loads()) on it. Since I want to convert the whole column to dictionaries I tried articles['word_count'].apply( lambda x: json.loads(json.loads(x)) ) but this gives me an error:
TypeError: the JSON object must be str, bytes or bytearray, not 'float'
How do I fix this? OR am I missing a command when I export to CSV from my database? OR am I missing a command when I call read_csv in Pandas?
Note: I have tried the 'converter' option with read_csv and I get this error: JSONDecodeError: Expecting value: line 1 column 1 (char 0) My function is:
def dec(s):
return json.loads( json.loads(s) )
Use pd.io.json.json_normalize() to convert an entire column of JSON data into a separate DataFrame with the same number of rows:
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.io.json.json_normalize.html
For your case it'd be something like this:
pd.io.json.json_normalize(articles.word_count)
You might have to preprocess it if Pandas doesn't understand the escaping in your input data.
Beyond all that, since your data comes from a database, you should consider just loading it directly, without the CSV intermediary. Pandas has functions for this, such as read_sql_query() and read_sql_table().