Saving Pandas DataFrame and meta-data to JSON format - json

I have a need to save a Pandas DataFrame, along with some metadata to a file in JSON format. (The JSON format is a requirement.)
Background
A) I can successfully read/write my rather large Pandas Dataframe from/to JSON using DataFrame.to_json() and DataFrame.from_json(). No problems.
B) I have no problems saving my metadata (dict) to JSON using json.dump()/json.load()
My first attempt
Since Pandas does not support DataFrame metadata directly, my first thought was to
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
json.dump(top_level_dict, fp)
Failure modes
C) I have found that even the simplified case of
df_dict = df.to_dict()
json.dump(df_dict, fp)
fails with:
TypeError: key (u'US', 112, 5, 80, 'wl') is not a string
D) Investigating, I've found that the complement also fails.
df.to_json(fp)
json.load(fp)
fails with
384 raise ValueError("No JSON object could be decoded")
ValueError: Expecting : delimiter: line 1 column 17 (char 16)
So it appears that Pandas JSON format and the Python's JSON library are not compatible.
My first thought is to chase down a way to modify the df.to_dict() output of C to make it amenable to Python's JSON library, but I keep hearing "If you're struggling to do something in Python, you're probably doing it wrong." in my head.
Question
What is the cannonical/recommended method for adding metadata to a Pandas DataFrame and storing to a JSON-formatted file?
Python 2.7.10
Pandas 0.17
Edit 1:
While trying out Evan Wright's great answer, I found the source of my problems: Pandas (as of 0.17) does not like saving Multi-Indexed DataFrames to JSON. The library I had created to save my (Multi-Indexed) DataFrames is quietly performing a df.reset_index() before calling DataFrame.to_json(). My newer code was not. So it was DataFrame.to_json() burping on the MultiIndex.
Lesson: Read the documentation kids, even when it's your own documentation.
Edit 2:
If you need to store both the DataFrame and the metadata in a single JSON object, see my answer below.

You should be able to just put the data on separate lines.
Writing:
f = open('test.json', 'w')
df.to_json(f)
print >> f
json.dump(metadata, f)
Reading:
f = open('test.json')
df = pd.read_json(next(f))
metdata = json.loads(next(f))

In my question, I erroneously stated that I needed the JSON in a file. In that situation, Evan Wright's answer is my preferred solution.
In my case, I actually need to store the JSON output as a single "blob" in a database, so my dictionary-wrangling approach appears to be necessary.
If you similarly need to store the data and metadata in a single JSON blob, the following code will work:
top_level_dict = {}
top_level_dict['data'] = df.to_dict()
top_level_dict['metadata'] = {'some':'stuff'}
with open(FILENAME, 'w') as outfile:
json.dump(top_level_dict, outfile)
Just make sure DataFrame is singly-indexed. If it's Multi-Indexed, reset the index (i.e. df.reset_index()) before doing the above.
Reading the data back in:
with open(FILENAME, 'r') as infile:
top_level_dict = json.load(infile)
df_as_dict = top_level_dict.pop('data', {})
df = pandas.DataFrame().as_dict(df_as_dict)
meta = top_level_dict['metadata']
At this point, you'll need to re-create your Multi-Index (if applicable)

Related

I'm getting UnicodeDecodeError when trying to load a JSON file into a dataframe

So, I'm using the following code to get pandas to read my JSON text file-
f = open('C:/Users/stans/WFH Project/data.json')
data = json.load(f)
df = pd.DataFrame(data, index=[0])
f.close()
Once I execute the cell, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
1535: character maps to
I used the above coding for a smaller sample of JSON data and it worked. But, since I updated the file to include a much larger sample, I get that error.
I verified that the JSON format is correct, and I also tried in the open statement-
encoding='utf-8'
and
errors='ignore'
Both produced value errors. Any ideas? Thanks in advance for your help!

How can I write to JSON file, without deleting all the content in it?

import json
f = open("filename.json", "w")
data = {"username": "justausername"}
json.dump(data, f)
When I run this code, all the data in the "filename.json" is replaced by "{'username': 'justausername'}". Please help!
Read the file
Parse the JSON to a data structure
Modify the data structure instead of creating a new one
Serialise the data structure back to JSON
Write it to the file
Consider using a real database instead so that you get benefits like automatic protection for concurrent edits. SQLite is a good choice if you want a single file to store the data in.
import json
with open("filename.json", "r") as f: # reading a file
data = json.load(f) # deserialization
data["username"] = "justausername" # modifying the python object
with open("filename.json", "w") as f:
json.dump(data, f) # serializing back to the original file

How open and read JSON file?

I have json file but this file have weight 186 mb. I try read via python .
import json
f = open('file.json','r')
r = json.loads(f.read())
ValueError: Extra data: line 88 column 2 -...
FILE
How to open it? Help me
Your JSON file isn't a JSON file, it's several JSON files mashed together.
The first instance of this occurs in the 1630070th character:
'шова"}]}]}{"response":[{"count'
^ here
That said, jq appears to be able to handle it, so the individual parts are fine.
You'll need to split the file at the boundaries of the individual JSON objects. Try catching the JSONDecodeError and use its .colno to slice the text into correct chunks.
It should be:
r = json.loads(f)

Dealing with commas within a field in a csv file using pyspark

I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
How to get the right values after splitting the line by commas in PySpark?
Use spark-csv class from databriks.
Delimiters between quotes, by default ("), are ignored.
Example:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For more info, review https://github.com/databricks/spark-csv
If your quote is (') instance of ("), you could configure with this class.
EDIT:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
Best regards.
If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.
Dependencies:
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd
Read the whole file at once into a Spark DataFrame:
sc = SparkContext('local','example') # if using locally
sql_sc = SQLContext(sc)
pandas_df = pd.read_csv('file.csv') # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2'])
s_df = sql_sc.createDataFrame(pandas_df)
Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:
chunk_100k = pd.read_csv('file.csv', chunksize=100000)
for chunky in chunk_100k:
Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
try:
Spark_full_rdd += Spark_temp_rdd
except NameError:
Spark_full_rdd = Spark_temp_rdd
del Spark_temp_rdd
Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
I'm (really) new to Pyspark, but have been using Pandas for the past years. What I'm going to put here might not be ultimately the best solution, but it works for me so I think it's worth posting here.
I'm encountering the same issue loading in a CSV file with extra comma embedded in one special field, which triggered an error if using Pyspark, but had no problem if using Pandas. So I looked around for a solution to deal with this extra delimiter, and the following piece of code solved my issue:
df = sqlContext.read.format('csv').option('header','true').option('maxColumns','3').option('escape','"').load('cars.csv')
I personally like to force the 'maxColumns' parameter to allow only a specific number of columns. So if the "BBB,B" somehow got parsed into two strings, spark is going to give an error message and print the whole line for you. And the 'escape' option is the one that really fixed my issue. I don't know if this helps, but hopefully that's something to run experiments with.

How to read .csv file that contains utf-8 values by pandas dataframe

I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0,
pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recomend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except:
df = pd.read_csv('file', encoding= "ISO-8859-1")