How to parse rows of a small DataFrame as json strings? - json

I have a DataFrame df is the result of some pre-processing. The size of df is around 10,000 rows.
I save this DataFrame in CSV as follows:
df.coalesce(1).write.option("sep",";").option("header","true").csv("output/path")
Now I want to save this DataFrame as txt file in which is row is a JSON string. So, the column names should be passed to attributes in JSON strings.
For example:
df =
col1 col2 col3
aa 34 55
bb 13 77
json_txt =
{"col1": "aa", "col2": "34", "col3": "55"}
{"col1": "bb", "col2": "13", "col3": "77"}
Which is the best way to do it?

You can use write.json api to save a dataframe in json format as
df.coalesce(1).write.json("output path of json file")
Above code would create a json file. But if you want a text format (json text) then you can use toJSON api as
df.toJSON.rdd.coalesce(1).saveAsTextFile("output path to text file")
I hope the answer is helpful

Related

How to loop through json and create a dataframe

I have a JSON file like below, how can I make a dataframe out of this. I want to make the main key an index and subkey as a column.
{
"PACK": {
"labor": "Recycle",
"actual": 0,
"Planned": 2,
"max": 6
},
"SORT": {
"labor": "Mix",
"actual": 10,
"Planned": 4,
"max": 3
}
}
The expected output is something like, I tried to use df.T but does not work. Any help on this is appreciated.
actual planned
PACK 0 2
SORT 10 4
You can read your json file to dict. Then create dataframe with dict values as data and dict keys as index.
import json
import pandas as pd
with open('test.json') as f:
data = json.load(f)
df = pd.DataFrame(data.values(), index=data.keys())
print(df)
labor actual Planned max
PACK Recycle 0 2 6
SORT Mix 10 4 3
The select columns with
df = df[['actual', 'planned']]
Pandas can read JSON files in many formats. For your use case, the following option should read your data the way you want:
pd.read_json(json_file, orient="index")
More information about the orient option can be found at the official documentation.

How to read with pandas a set of arrays containing one JSON object each?

I am trying to read from a text file into a pandas dataframe. The text file seems to be a 2D array of JSON, how could I read it?
[[{'metric_name':'CPU','category':'A','data':'9','time_stamp':'2019-03-28 13:15:31'}],[{'metric_name':'Disk','category':'B','data':'56','time_stamp':'2019-03-28 13:15:31'}]]
I expect to have the parameters "metric_name", "category", "data", "time_stamp" as headers
Here is a solution :
import json
import pandas as pd
# load the file
raw_data = json.load(open('myfile.json'))
# raw_data contains a nested list, so convert it to a simple list :
data = [x[0] for x in raw_data]
# then create the dataframe
df = pd.DataFrame.from_records(data)
Here is the content of data. The nested list has been converted to a simple list (assuming that we have one record per array) :
[{"category": "VM1",
"data": "9",
"metric_name": "CPU",
"time_stamp": "2019-03-28 13:15:31"},
{"category": "VM1",
"data": "9",
"metric_name": "CPU",
"time_stamp": "2019-03-28 13:15:31"}]

how to convert JSON data to .tsv file using python.

My json data looks like this :
data ={
"time": "2018-10-02T10:19:48+00:00",
"class": "NOTIFICATION",
"type": "Access Control",
"event": "Window/Door",
"number": -61
}
Desired output have to be like this:
time class type event number
2018-10-02T10:19:48+00:00 NOTIFICATION Access Control Window/Door -61
could anyone help me out, Thanks in advance
I think it's the same as converting JSON to csv, but instead of using the comma you can use tab as a delimeter, as follows:
import json
import csv
# input data
json_file = open("data.json", "r")
json_data = json.load(json_file)
json_file.close()
data = json.loads(json_data)
tsv_file = open("data.tsv", "w")
tsv_writer = csv.writer(tsv_file, delimiter='\t')
tsv_writer.writerow(data[0].keys()) # write the header
for row in data: # write data rows
tsv_writer.writerow(row.values())
tsv_file.close()
The above code will work if you json file has multiple data rows. If you have only one data row, the below code should work for you:
tsv_writer.writerow(data.keys()) # write the header
tsv_writer.writerow(data.values()) # write the values
Hope this helps.

How to create DataFrame based on multiple JSON files

I have many JSON files inside a folder. All of them have the same structure. Now I want to create the DataFrame, and each JSON file should be the row of this DataFrame.
I know how to create DataFrame based on a single JSON string, but I don't know how to deal with multiple ones:
import spark.implicits._
val jsonStr = """{ "key": 111, "value": 54, stamp: "aaa"}"""
val df = spark.read.json(Seq(jsonStr).toDS)
Assuming you have your JSONs in folder src/main/resources
Following code will produce desired result:
private val df: DataFrame = spark.read.json("src/main/resources")
df.show()
+---+-----+-----+
|key|stamp|value|
+---+-----+-----+
|111| aaa| 54|
|111| aaa| 54|
+---+-----+-----+
Note that JSON should be machine-readable, not human readable (that means that JSONs shouldn't have new line characters.

How to clean up my JSON response before writing to file in python/django?

This where the fucntion converts dictionary to json for a HttpResponse -
HttpResponse(
json.dumps(cm_dict),
content_type='application/javascript; charset=utf8'
)
I understand that since content type is set, the json response is displayed pretty. But I what want to do is write the the json to a file and come in a somewhat similar format.
What i get is this -
"{\"a\" : \"b\", \"c\" : \"d\"}"
that is written into file using the below -
with open('data.json', 'w') as outfile:
json.dump(json_data, outfile, sort_keys=True, indent=4,ensure_ascii=False)
what i want is -
{
"a": "b",
"c": "d"
}
Looks like the contents of json_data is already JSON. There is no need to dump it to JSON again; just write the string.