Writing spark dataframe to ascii JSON - json

I am attempting to write a spark dataframe as JSON file; this will eventually be written out into MapR JSON DB table.
grp_small.toJSON.write.save("<path>")
This seems to write JSON file in snappy.parquet format. How do I force it to write it as a readable JSON (txt format) ?

You can write dataframe to json which contains each row as readable json in each line.
grp_small.write.json("path to output")
Hope this hepls!

Related

pyspark dataframe to valid json

Im trying to convert a dataframe to a valid json format, howeever I have not succeeded yet.
if I do like this:
fullDataset.repartition(1).write.json(f'{mount_point}/eds_ckan', mode='overwrite', ignoreNullFields=False)
I only get row based json like this:
{"col1":"2021-10-09T12:00:00.000Z","col2":336,"col3":0.0}
{"col1":"2021-10-16T20:00:00.000Z","col2":779,"col3":6965.396}
{"col1":"2021-10-17T12:00:00.000Z","col2":350,"col3":0.0}
Does anyone know how to convert it to valid json which is not row based?
Below is the sample example on converting dataframe to valid Json
Try using Collect and then using json.dump
import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
json.dump(data, outfile)
here are few links with the related discussions you can go through for complete information.
Dataframe to valid JSON
Valid JSON in spark

JSON, CSV and Python

I have a CSV file with a column with JSON formatted data.
How can I extract the JSON data into a CSV file that can be processed in Access or SQL?
What code language can be used and how will that code look like?

DataFrame write to CSV not supporting some characters

I am trying to parse the XML file and write to DataFrame result to CSV file.
My problem is some of characters are not supported when i write the output to the CSV. For eg, there is a field Nectarine tree named ‘Polar Zee’ its writes like Nectarine tree named ‘Polar Zee’.
Is there any settings need to be change? or any properties need to be added?

how to save json data into csv using python

[I have json data like this 1]
I wanted to save the json into csv
the out put will be like this ,each tittle will be holding the information in that titile
I hope this gets converted to a comment, but look at Pandas, it can probably do what you want (Pandas json to csv)

Creating a Pandas DataFrame from a CSV file with JSON in it

I have a Postgres database where two columns are jsonb data. I used this command to get a CSV copy of the database: \copy (SELECT * FROM articles) TO articles.csv CSV DELIMITER ‘,’ HEADER
I am using Python 3.6. When I load this CSV file into a Pandas dataframe with read_csv I get what appears to be a doubly encoded string for all the json data:
e.g. articles.iloc[0]['word_count'] gives me:
'"{\\"he\\":8,\\"is\\":8,\\"a\\":26,\\"wealthy\\":1,\\"international\\":2,\\"entrepreneur\\":1,\\"known\\":3,\\"for\\":9,\\"generous\\":1,\\"donations\\":2,\\"to\\":17,\\"his\\":6,\\"alma\\":1,\\"mater\\":1,\\"harvard\\":11,\\"now\\":2,\\"court\\":12,\\"says\\":1,\\"the\\":51,\\"university\\":3,\\"must\\":2,\\"cooperate\\":1,\\"in\\":21,\\"hunt\\":1,\\"assets\\":3,\\"federal\\":2,\\"judge\\":2,\\"boston\\":3,\\"has\\":4,\\"ruled\\":2,\\"that\\":10,\\"provide\\":1,\\"testimony\\":1,\\"and\\":11,\\"produce\\":1,\\"documents\\":3,\\"disclosing\\":1,\\"bank\\":1,\\"accounts\\":1,\\"routing\\":1,\\"numbers\\":1,\\"wire\\":1,\\"transfers\\":1,\\"other\\":2,\\"interbank\\":1,\\"messages\\":1,\\"used\\":1,\\"by\\":11,\\"an\\":6,\\"alumnus\\":1,\\"charles\\":1,\\"c\\":2,\\"spackman\\":19,\\"send\\":1,\\"money\\":2,\\"mr\\":19,\\"hong\\":5,\\"kongbased\\":1,\\"businessman\\":1,\\"leads\\":2,\\"group\\":4,\\"global\\":1,\\"investment\\":1,\\"holding\\":1,\\"company\\":10,\\"with\\":3,\\"billion\\":1,\\"under\\":1,\\"management\\":1,\\"ruling\\":4,\\"places\\":1,\\"ivy\\":1,\\"league\\":1,\\"college\\":1,\\"uncomfortable\\":1,\\"predicament\\":1,\\"of\\":19,\\"revealing\\":1,\\"confidential\\":1,\\"financial\\":1,\\"information\\":3,\\"gleaned\\":1,\\"from\\":2,\\"influential\\":1,\\"benefactor\\":1,\\"no\\":2,\\"small\\":1,\\"donor\\":1,\\"according\\":2,\\"website\\":1,\\"sponsors\\":1,\\"scholarship\\":2,\\"fund\\":1,\\"asian\\":1,\\"students\\":1,\\"at\\":2,\\"harvardasia\\":1,\\"council\\":1,\\"served\\":1,\\"as\\":2,\\"cochairman\\":1,\\"reunion\\":1,\\"gifts\\":1,\\"class\\":1,\\"year\\":1,\\"also\\":2,\\"korean\\":6,\\"name\\":1,\\"yoo\\":1,\\"shin\\":1,\\"choi\\":1,\\"obtained\\":1,\\"undergraduate\\":1,\\"degree\\":1,\\"economics\\":1,\\"spokeswoman\\":1,\\"melodie\\":1,\\"jackson\\":1,\\"said\\":8,\\"would\\":2,\\"not\\":5,\\"comment\\":2,\\"on\\":4,\\"order\\":1,\\"part\\":1,\\"longfought\\":1,\\"quest\\":1,\\"aggrieved\\":1,\\"investor\\":2,\\"sang\\":1,\\"cheol\\":1,\\"woo\\":3,\\"collect\\":2,\\"judgment\\":4,\\"against\\":1,\\"involving\\":1,\\"south\\":4,\\"business\\":3,\\"deal\\":1,\\"case\\":3,\\"could\\":2,\\"have\\":3,\\"furtherreaching\\":1,\\"implications\\":1,\\"douglas\\":1,\\"kellner\\":2,\\"manhattan\\":1,\\"lawyer\\":2,\\"who\\":2,\\"specializes\\":1,\\"recovering\\":1,\\"hidden\\":1,\\"worldwide\\":1,\\"if\\":2,\\"diverted\\":1,\\"funds\\":1,\\"when\\":1,\\"should\\":1,\\"been\\":3,\\"paying\\":1,\\"thats\\":1,\\"fraudulent\\":1,\\"transfer\\":1,\\"they\\":2,\\"sue\\":2,\\"get\\":2,\\"back\\":2,\\"theyd\\":1,\\"be\\":1,\\"entitled\\":1,\\"it\\":4,\\"can\\":1,\\"show\\":1,\\"was\\":6,\\"fraudulently\\":1,\\"transferred\\":1,\\"john\\":1,\\"han\\":1,\\"firm\\":2,\\"kobre\\":1,\\"kim\\":1,\\"which\\":5,\\"handling\\":1,\\"investors\\":1,\\"had\\":3,\\"plans\\":1,\\"unwittingly\\":1,\\"entangled\\":1,\\"dispute\\":1,\\"collection\\":1,\\"effort\\":1,\\"dates\\":1,\\"stock\\":2,\\"collapse\\":2,\\"littauer\\":2,\\"technologies\\":1,\\"ltd\\":1,\\"technology\\":1,\\"seoul\\":1,\\"high\\":2,\\"major\\":1,\\"fled\\":1,\\"korea\\":3,\\"amid\\":1,\\"claims\\":1,\\"price\\":1,\\"manipulation\\":1,\\"departing\\":1,\\"before\\":3,\\"authorities\\":2,\\"arrested\\":1,\\"partner\\":1,\\"later\\":3,\\"insiders\\":1,\\"profited\\":1,\\"selling\\":1,\\"their\\":1,\\"shares\\":1,\\"while\\":2,\\"minority\\":1,\\"shareholders\\":1,\\"including\\":1,\\"suffered\\":1,\\"enormous\\":1,\\"losses\\":1,\\"ordered\\":1,\\"pay\\":1,\\"million\\":2,\\"mushroomed\\":1,\\"because\\":5,\\"accumulating\\":1,\\"interest\\":1,\\"managing\\":1,\\"director\\":1,\\"richard\\":1,\\"lee\\":1,\\"related\\":1,\\"lawsuit\\":1,\\"pending\\":1,\\"kong\\":4,\\"filed\\":1,\\"appeared\\":1,\\"unaware\\":1,\\"until\\":2,\\"just\\":1,\\"overturned\\":1,\\"supreme\\":1,\\"all\\":1,\\"defendants\\":1,\\"except\\":1,\\"upheld\\":1,\\"him\\":1,\\"did\\":2,\\"appear\\":1,\\"defend\\":1,\\"himself\\":1,\\"acknowledging\\":1,\\"fined\\":1,\\"connection\\":1,\\"matter\\":1,\\"maintains\\":1,\\"commit\\":1,\\"offenses\\":1,\\"woos\\":1,\\"lawyers\\":1,\\"argue\\":1,\\"efforts\\":1,\\"hampered\\":1,\\"what\\":1,\\"papers\\":2,\\"called\\":1,\\"mazelike\\":1,\\"network\\":1,\\"offshore\\":1,\\"nominees\\":1,\\"trusts\\":1,\\"many\\":1,\\"are\\":1,\\"managed\\":1,\\"close\\":1,\\"family\\":1,\\"members\\":1,\\"classmates\\":1,\\"example\\":1,\\"estate\\":1,\\"where\\":1,\\"lives\\":1,\\"section\\":1,\\"forbes\\":1,\\"described\\":1,\\"wealthiest\\":1,\\"neighborhood\\":1,\\"earth\\":1,\\"owned\\":2,\\"through\\":1,\\"series\\":1,\\"shell\\":1,\\"companies\\":1,\\"turn\\":3,\\"british\\":1,\\"virgin\\":1,\\"islands\\":1,\\"say\\":1,\\"entered\\":1,\\"feb\\":1,\\"william\\":1,\\"g\\":1,\\"young\\":1,\\"district\\":1,\\"gives\\":1,\\"march\\":1,\\"over\\":2,\\"banking\\":1,\\"orders\\":1,\\"spackmans\\":2,\\"daughter\\":1,\\"claire\\":1,\\"sophomore\\":1,\\"testify\\":1,\\"records\\":1,\\"about\\":1,\\"her\\":1,\\"fathers\\":1,\\"american\\":1,\\"citizen\\":1,\\"permanent\\":1,\\"resident\\":1,\\"well\\":1,\\"partly\\":1,\\"son\\":1,\\"james\\":1,\\"adopted\\":1,\\"americans\\":1,\\"after\\":1,\\"biological\\":1,\\"parents\\":1,\\"died\\":1,\\"during\\":1,\\"war\\":1,\\"advanced\\":1,\\"world\\":1,\\"become\\":1,\\"chief\\":1,\\"prudentials\\":1,\\"insurance\\":1,\\"holdings\\":1,\\"younger\\":1,\\"include\\":1,\\"entertainment\\":1,\\"produced\\":1,\\"science\\":1,\\"fiction\\":1,\\"movie\\":1,\\"snowpiercer\\":1,\\"starring\\":1,\\"tilda\\":1,\\"swinton\\":1,\\"octavia\\":1,\\"spencer\\":1}"'
In order to get a python dictionary from the above string I have to call json.loads(json.loads()) on it. Since I want to convert the whole column to dictionaries I tried articles['word_count'].apply( lambda x: json.loads(json.loads(x)) ) but this gives me an error:
TypeError: the JSON object must be str, bytes or bytearray, not 'float'
How do I fix this? OR am I missing a command when I export to CSV from my database? OR am I missing a command when I call read_csv in Pandas?
Note: I have tried the 'converter' option with read_csv and I get this error: JSONDecodeError: Expecting value: line 1 column 1 (char 0) My function is:
def dec(s):
return json.loads( json.loads(s) )
Use pd.io.json.json_normalize() to convert an entire column of JSON data into a separate DataFrame with the same number of rows:
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.io.json.json_normalize.html
For your case it'd be something like this:
pd.io.json.json_normalize(articles.word_count)
You might have to preprocess it if Pandas doesn't understand the escaping in your input data.
Beyond all that, since your data comes from a database, you should consider just loading it directly, without the CSV intermediary. Pandas has functions for this, such as read_sql_query() and read_sql_table().