How to convert hierarchical DataFrame back from json? [duplicate] - json
I'm trying to read in a dataframe created via df.to_json() via pd.read_json but I'm getting a ValueError. I think it may have to do with the fact that the index is a MultiIndex but I'm not sure how to deal with that.
The original dataframe of 55k rows is called psi and I created test.json via:
psi.head().to_json('test.json')
Hereis the output of print psi.head().to_string() if you want to use that.
When I do it on this small set of data (5 rows), I get a ValueError.
! wget --no-check-certificate https://gist.githubusercontent.com/olgabot/9897953/raw/c270d8cf1b736676783cc1372b4f8106810a14c5/test.json
import pandas as pd
pd.read_json('test.json')
Here's the full stack:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-1de2f0e65268> in <module>()
1 get_ipython().system(u' wget https://gist.githubusercontent.com/olgabot/9897953/raw/c270d8cf1b736676783cc1372b4f8106810a14c5/test.json'>)
2 import pandas as pd
----> 3 pd.read_json('test.json')
/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit)
196 obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
197 keep_default_dates, numpy, precise_float,
--> 198 date_unit).parse()
199
200 if typ == 'series' or obj is None:
/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in parse(self)
264
265 else:
--> 266 self._parse_no_numpy()
267
268 if self.obj is None:
/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in _parse_no_numpy(self)
481 if orient == "columns":
482 self.obj = DataFrame(
--> 483 loads(json, precise_float=self.precise_float), dtype=None)
484 elif orient == "split":
485 decoded = dict((str(k), v)
ValueError: No ':' found when decoding object value
> /home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.py(483)_parse_no_numpy()
482 self.obj = DataFrame(
--> 483 loads(json, precise_float=self.precise_float), dtype=None)
484 elif orient == "split":
But when I do it on the whole dataframe (55k rows) then I get an invalid pointer error and the IPython kernel dies. Any ideas?
EDIT: added how the json was generated in the first place.
This is not implemented ATM, see the issue here: https://github.com/pydata/pandas/issues/4889.
You can simply reset the index first, e.g
df.reset_index().to_json(...)
and it will work.
Or you can just write json with orient = 'table'
df.to_json(path_or_buf='test.json', orient='table')
read multi_index json
pd.read_json('test.json', orient='table')
if you want to return MultiIndex structure:
# save MultiIndex indexes names
indexes_names = df.index.names
df.reset_index().to_json('dump.json')
# return back MultiIndex structure:
loaded_df = pd.read_json('dump.json').set_index(indexes_names)
this was my simple dirty fix for encoding/decoding multiindex pandas dataframe which seems to also work for datetime in index/columns... not optimized!
here is the encoder to json - I encoder the dataframe, index and columns into a dict to create a json
import json
import pandas as pd
def to_json_multiindex(df):
dfi = df.index.to_frame()
dfc = df.columns.to_frame()
d = dict(
df = df.to_json(),
di = dfi.to_json(),
dc = dfc.to_json()
)
return json.dumps(d)
meanwhile here is the decoder which reads the json dict and re-creates the dataframe
def read_json_multiindex(j):
d = json.loads(j)
di=pd.read_json(d['di'])
if di.shape[1]>1:
di = pd.MultiIndex.from_frame(di)
else:
_name = di.columns[0]
di = di.index
di.name = _name
dc=pd.read_json(d['dc'])
if dc.shape[1]>1:
dc = pd.MultiIndex.from_frame(dc)
else:
_name = dc.columns[0]
dc = dc.index
dc.name = _name
df = pd.read_json(d['df']).values
return pd.DataFrame(
data=df,
index=di,
columns=dc,
)
and here is a test for multiindex columns and index... seems to preserve the dataframe. Couple of issues 1) probably inefficient and 2) does seem to work for datatime in multiindex (but works when it isn't multiindex)
df = pd.DataFrame(
data = [[0,1,2],[2,3,4],[5,6,7]],
index = pd.MultiIndex.from_tuples(
(('aa','bb'),('aa','cc'),('bb','cc')
),
names=['AA','BB']),
columns = pd.MultiIndex.from_tuples(
(('XX','YY'),('XX','ZZ'),('YY','ZZ')
),
names=['YY','ZZ'])
)
j = to_json_multiindex(df)
d = read_json_multiindex(j)
Related
reading json to pandas DataFrame but with thousands of rows to pandas append
I have an text file, where each line I have cleansed up to be of a json format. I can read each line, clean them, and convert them into a panda dataframe. My problem is that I want to add/combine them all into one dataframe, but there are more than 200k lines. I am reading each line in as 'd' = '{"test1":"test2","data":{"key":{"isin":"test3"},"creationTimeStamp":1541491884194,"signal":0,"hPreds":[0,0,0,0],"bidPrice":6.413000,"preferredBidSize":1,"offerPrice":6.415000,"preferredOfferSize":1,"averageTradeSize":1029,"averageTradePrice":0.065252,"changedValues":true,"test4":10,"snapshot":false}}' Assume I am able to convert each line into a panda... is there a way to append each line into the panda dataframe, such that it is very fast. Right now, with >200k lines, it takes hours to append... reading the file itself takes less than 5 min... file ='fileName.txt' with open(file) as f: content = f.readlines() content = [x.strip() for x in content] data = pd.DataFrame() count = 0 for line in content: line = line.replace('{"string1','') z = line.splitlines() z[0] = z[0][:-1] z = pd.read_json('[%s]' % ','.join(z)) data = data.append(z)
You may check with Series pd.Series(d) Out[154]: averageTradePrice 0.065 averageTradeSize 109 bidPrice 6.13 changedValues True creationTimeStamp 15414994 Preds [0, 0, 0, 0] key {'epic': 'XXX'} dataLevel 10 offerPrice 3.333 dtype: object Preds and key's value are list and dict , that is why when you pass it to DataFrame it flag as : ValueError: arrays must all be same length Since you mention json from pandas.io.json import json_normalize json_normalize(d) Out[157]: Preds averageTradePrice ... key.epic offerPrice 0 [0, 0, 0, 0] 0.065 ... XXX 3.333 [1 rows x 9 columns]
Pyspark Dropping the header in a dataframe, AttributeError: _jdf
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark = sqlContext.sparkSession avg_calc = spark.read.csv("quiz2_algo.csv", header= True,inferSchema=True) header = avg_calc.first() no_header = avg_calc.subtract(header) no_header avg_calc contains 2 columns and I am trying to remove the 1st row from both columns, however I am receiving the following error: --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-50-24671d91e691> in <module>() ----> 1 no_header = avg_calc.subtract(header) C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\dataframe.pyc in subtract(self, other) 1391 1392 """ -> 1393 return DataFrame(getattr(self._jdf, "except")(other._jdf), self.sql_ctx) 1394 1395 #since(1.4) C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.pyc in __getattr__(self, item) 1559 raise AttributeError(item) 1560 except ValueError: -> 1561 raise AttributeError(item) 1562 1563 def __setattr__(self, key, value): AttributeError: _jdf If anyone can help I would appreciate it! Example of the data: avg_calc.show(5)
first() returns a Row object rather than a DataFrame which is required by subtract. See http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.first You could try something like: avg_calc.subtract(avg_calc.limit(1)) For example: >>> df = spark.createDataFrame([Row(x=1), Row(x=2)]) >>> print(df.subtract(df.limit(1)).toPandas()) x 0 2 Apply an ordering to you dataframe to ensure the row you would like dropped is in the correct location: >>> from pyspark.sql import functions as F >>> df = df.orderBy(F.col('CS202 Quiz#2').desc()) >>> df = df.subtract(df.limit(1))
Setting column on empty dataframe
I'm reading json arrays from a text file and then create an empty dataframe. I want to add a new column 'id' to the empty dataframe. 'id' comes from the json arrays in the text file. Error message reads "Cannot set a frame with no defined index and a value that canot be converted to a series". I tried to overcome this error by defining dataframe size upfront which did not help. Any ideas? import json import pandas as pd path = 'my/path' mydata = [] myfile = open(path, "r") for line in myfile: try: myline = json.loads(line) mydata.append(myline) except: continue mydf = pd.DataFrame() mydf['id'] = map(lambda myline: myline['id'], mydata)
I think better is use: for line in myfile: try: #extract only id to list myline = json.loads(line)['id'] mydata.append(myline) except: continue print (mydata) [10, 5] #create DataFrame by constructor mydf = pd.DataFrame({'id':mydata}) print (mydf) id 0 10 1 5
Convert json file into csv encode/decode problems
My semester project is about classification by using Naive bayes. I ve decided to use Yelp dataset. While I was turning the json file into csv file I came up with couple of problems. Such as : json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Its because of the wrong usage of json.loads(). I tried a couple of deifferent usage of the function to manage this part of the program. Unfortunately, none of them worked. I put my code down below, if you have any idea about how to handle this, can you please explain it to me? ` import json import pandas as pd from glob import glob import codecs global df global s global count def convert(x): ob = json.loads(x) for k, v in ob.items(): if isinstance(v, list): ob[k] = ','.join(v) elif isinstance(v, dict): for kk, vv in v.items(): ob['%s_%s' % (k, kk)] = vv del ob[k] return ob s = "" count = 0 for json_filename in glob('*.json'): csv_filename = '%s.csv' % json_filename[:-5] print('Converting %s to %s' % (json_filename, csv_filename)) with open('yelp_dataset_challenge_round9.json','rb') as f: #open in binary mode for line in f: for cp in ('cp1252', 'cp850'): try: if count is 0: count = 1 else: s = str(line.decode('utf-8')) except UnicodeDecodeError: pass df = pd.DataFrame([convert(s)]) df.to_csv(csv_filename, encoding='utf-8', index=False) ` Thanks in advance :)
How to read a pandas Series from a CSV file
I have a CSV file formatted as follows: somefeature,anotherfeature,f3,f4,f5,f6,f7,lastfeature 0,0,0,1,1,2,4,5 And I try to read it as a pandas Series (using pandas daily snapshot for Python 2.7). I tried the following: import pandas as pd types = pd.Series.from_csv('csvfile.txt', index_col=False, header=0) and: types = pd.read_csv('csvfile.txt', index_col=False, header=0, squeeze=True) But both just won't work: the first one gives a random result, and the second just imports a DataFrame without squeezing. It seems like pandas can only recognize as a Series a CSV formatted as follows: f1, value f2, value2 f3, value3 But when the features keys are in the first row instead of column, pandas does not want to squeeze it. Is there something else I can try? Is this behaviour intended?
Here is the way I've found: df = pandas.read_csv('csvfile.txt', index_col=False, header=0); serie = df.ix[0,:] Seems like a bit stupid to me as Squeeze should already do this. Is this a bug or am I missing something? /EDIT: Best way to do it: df = pandas.read_csv('csvfile.txt', index_col=False, header=0); serie = df.transpose()[0] # here we convert the DataFrame into a Serie This is the most stable way to get a row-oriented CSV line into a pandas Series. BTW, the squeeze=True argument is useless for now, because as of today (April 2013) it only works with row-oriented CSV files, see the official doc: http://pandas.pydata.org/pandas-docs/dev/io.html#returning-series
This works. Squeeze still works, but it just won't work alone. The index_col needs to be set to zero as below series = pd.read_csv('csvfile.csv', header = None, index_col = 0, squeeze = True)
In [28]: df = pd.read_csv('csvfile.csv') In [29]: df.ix[0] Out[29]: somefeature 0 anotherfeature 0 f3 0 f4 1 f5 1 f6 2 f7 4 lastfeature 5 Name: 0, dtype: int64
ds = pandas.read_csv('csvfile.csv', index_col=False, header=0); X = ds.iloc[:, :10] #ix deprecated
As Pandas value selection logic is : DataFrame -> Series=DataFrame[Column] -> Values=Series[Index] So I suggest : df=pandas.read_csv("csvfile.csv") s=df[df.columns[0]]
from pandas import read_csv series = read_csv('csvfile.csv', header=0, parse_dates=[0], index_col=0, squeeze=True
Since none of the answers above worked for me, here is another one, recreating the Series manually from the DataFrame. # create example series series = pd.Series([0, 1, 2], index=["a", "b", "c"]) series.index.name = "idx" print(series) print() # create csv series_csv = series.to_csv() print(series_csv) # read csv df = pd.read_csv(io.StringIO(series_csv), index_col=0) indx = df.index vals = [df.iloc[i, 0] for i in range(len(indx))] series_again = pd.Series(vals, index=indx) print(series_again) Output: idx a 0 b 1 c 2 dtype: int64 idx,0 a,0 b,1 c,2 idx a 0 b 1 c 2 dtype: int64