I have collected some 12,000 tweets following the code of http://adilmoujahid.com/posts/2014/07/twitter-analytics/
The problem is that I get error once the tweets are higher in number. The smaller number doesn't give this problem.
#adding columns
from pandas.io.json import json_normalize
tweets = json_normalize(tweet_data)[["text", "lang", "created_at", "user.time_zone", "user.location"]]
This gives me such result
AttributeError Traceback (most recent call last)
<ipython-input-21-19596361d3f0> in <module>()
1 #adding columns
2 from pandas.io.json import json_normalize
----> 3 tweets = json_normalize(tweet_data)[["text", "lang", "created_at", "user.time_zone", "user.location"]]
/usr/lib/python2.7/dist-packages/pandas/io/json.pyc in json_normalize(data, record_path, meta, meta_prefix, record_prefix)
713 # TODO: handle record value which are lists, at least error
714 # reasonably
--> 715 data = nested_to_record(data)
716 return DataFrame(data)
717 elif not isinstance(record_path, list):
/usr/lib/python2.7/dist-packages/pandas/io/json.pyc in nested_to_record(ds, prefix, level)
612
613 new_d = copy.deepcopy(d)
--> 614 for k, v in d.items():
615 # each key gets renamed with prefix
616 if level == 0:
AttributeError: 'int' object has no attribute 'items'
Is there any way to get out of this I am totally a novice in handling Pandas and Json stuff.
Related
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession
avg_calc = spark.read.csv("quiz2_algo.csv", header= True,inferSchema=True)
header = avg_calc.first()
no_header = avg_calc.subtract(header)
no_header
avg_calc contains 2 columns and I am trying to remove the 1st row from both columns, however I am receiving the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-50-24671d91e691> in <module>()
----> 1 no_header = avg_calc.subtract(header)
C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\dataframe.pyc in subtract(self, other)
1391
1392 """
-> 1393 return DataFrame(getattr(self._jdf, "except")(other._jdf), self.sql_ctx)
1394
1395 #since(1.4)
C:\spark\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\types.pyc in __getattr__(self, item)
1559 raise AttributeError(item)
1560 except ValueError:
-> 1561 raise AttributeError(item)
1562
1563 def __setattr__(self, key, value):
AttributeError: _jdf
If anyone can help I would appreciate it!
Example of the data: avg_calc.show(5)
first() returns a Row object rather than a DataFrame which is required by subtract. See http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.first
You could try something like:
avg_calc.subtract(avg_calc.limit(1))
For example:
>>> df = spark.createDataFrame([Row(x=1), Row(x=2)])
>>> print(df.subtract(df.limit(1)).toPandas())
x
0 2
Apply an ordering to you dataframe to ensure the row you would like dropped is in the correct location:
>>> from pyspark.sql import functions as F
>>> df = df.orderBy(F.col('CS202 Quiz#2').desc())
>>> df = df.subtract(df.limit(1))
I currently work myself through the caffe/examples/ to learn more about caffe/pycaffe.
In the 02-fine-tuning.ipynb-notebook there is a codecell which shows how to create a caffenet which takes unlabeled "dummmy data" as input, allowing us to set its input images externally. The notebook can be found here:
https://github.com/BVLC/caffe/blob/master/examples/02-fine-tuning.ipynb
There is a given code-cell, which throws an error:
dummy_data = L.DummyData(shape=dict(dim=[1, 3, 227, 227]))
imagenet_net_filename = caffenet(data=dummy_data, train=False)
imagenet_net = caffe.Net(imagenet_net_filename, weights, caffe.TEST)
error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-9f0ecb4d95e6> in <module>()
1 dummy_data = L.DummyData(shape=dict(dim=[1, 3, 227, 227]))
----> 2 imagenet_net_filename = caffenet(data=dummy_data, train=False)
3 imagenet_net = caffe.Net(imagenet_net_filename, weights, caffe.TEST)
<ipython-input-5-53badbea969e> in caffenet(data, label, train, num_classes, classifier_name, learn_all)
68 # write the net to a temporary file and return its filename
69 with tempfile.NamedTemporaryFile(delete=False) as f:
---> 70 f.write(str(n.to_proto()))
71 return f.name
~/anaconda3/envs/testcaffegpu/lib/python3.6/tempfile.py in func_wrapper(*args, **kwargs)
481 #_functools.wraps(func)
482 def func_wrapper(*args, **kwargs):
--> 483 return func(*args, **kwargs)
484 # Avoid closing the file as long as the wrapper is alive,
485 # see issue #18879.
TypeError: a bytes-like object is required, not 'str'
Anyone knows how to do this right?
tempfile.NamedTemporaryFile() opens a file in binary mode ('w+b') by default. Since you are using Python3.x, string is not the same type as for Python 2.x, hence providing a string as input to f.write() results in error since it expects bytes. Overriding the binary mode should avoid this error.
Replace
with tempfile.NamedTemporaryFile(delete=False) as f:
with
with tempfile.NamedTemporaryFile(delete=False, mode='w') as f:
This has been explained in a previous post:
TypeError: 'str' does not support the buffer interface
My dask dataframe has about 120 mm rows and 4 columns:
df_final.dtypes
cust_id int64
score float64
total_qty float64
update_score float64
dtype: object
and I'm doing this operation on jupyter notebooks connected to linux machine :
%time df_final.to_csv('/path/claritin-files-*.csv')
and it throws up this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-46468ae45023> in <module>()
----> 1 get_ipython().magic(u"time df_final.to_csv('path/claritin-files-*.csv')")
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2334 magic_name, _, magic_arg_s = arg_s.partition(' ')
2335 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2336 return self.run_line_magic(magic_name, magic_arg_s)
2337
2338 #-------------------------------------------------------------------------
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2255 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2256 with self.builtin_trap:
-> 2257 result = fn(*args,**kwargs)
2258 return result
2259
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
191 **# but it's overkill for just that one bit of state.**
192 def magic_deco(arg):
--> 193 call = lambda f, *a, **k: f(*a, **k)
194
195 if callable(arg):
/home/mspra/anaconda2/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1161 if mode=='eval':
1162 st = clock2()
-> 1163 out = eval(code, glob, local_ns)
1164 end = clock2()
1165 else:
<timed eval> in <module>()
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/core.pyc in to_csv(self, filename, **kwargs)
936 """ See dd.to_csv docstring for more information """
937 from .io import to_csv
--> 938 return to_csv(self, filename, **kwargs)
939
940 def to_delayed(self):
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.pyc in to_csv(df, filename, name_function, compression, compute, get, **kwargs)
411 if compute:
412 from dask import compute
--> 413 compute(*values, get=get)
414 else:
415 return values
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/base.pyc in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
74 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
75 cache=cache, get_id=_thread_get_id,
---> 76 **kwargs)
77
78 # Cleanup pools associated to dead threads
/home/mspra/anaconda2/lib/python2.7/site-packages/dask/async.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, dumps, loads, **kwargs)
491 _execute_task(task, data) # Re-execute locally
492 else:
--> 493 raise(remote_exception(res, tb))
494 state['cache'][key] = res
495 finish_task(dsk, key, state, results, keyorder.get)
**ValueError: invalid literal for long() with base 10: 'total_qty'**
Traceback
---------
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/async.py", line 268, in execute_task
result = _execute_task(task, data)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 55, in pandas_read_text
coerce_dtypes(df, dtypes)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 83, in coerce_dtypes
df[c] = df[c].astype(dtypes[c])
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 3054, in astype
raise_on_error=raise_on_error, **kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3189, in astype
return self.apply('astype', dtype=dtype, **kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3056, in apply
applied = getattr(b, f)(**kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 461, in astype
values=values, **kwargs)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 504, in _astype
values = _astype_nansafe(values.ravel(), dtype, copy=True)
File "/home/mspra/anaconda2/lib/python2.7/site-packages/pandas/types/cast.py", line 534, in _astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/lib.pyx", line 980, in pandas.lib.astype_intsafe (pandas/lib.c:17409)
File "pandas/src/util.pxd", line 93, in util.set_value_at_unsafe (pandas/lib.c:72777)
I have a couple of questions:
1) First of all this export was working fine on Friday, it spit out 100 csv files ( since it has 100 partitions), which I later aggregated. So what is wrong today -- anything from the error log?
2) May be this question is for the creators of this package, what is the most time-efficient way to get a csv extract out of a dask dataframe of this size, since it was taking about 1.5 to 2 hrs, the last time it was working.
I'm not using dask distributed and this is on single core of a linux cluster.
This error likely has little to do with to_csv and more to do with something else in your computation. The call to df.to_csv was just the first time you forced the computation to roll through all of the data.
Given the error I actually suspect that this is failing in read_csv. Dask.dataframe read the first few hundred kilobytes of your first file to guess at the datatypes, but it seems to have guessed incorrectly. You might want to try specifying dtypes explicitly in the read_csv call.
In regards to the second question about writing to CSV quickly, my first answer would be "use Parquet or HDF5 instead". They're much faster and more accurate in almost every respect.
I am trying to import a list of stock-tickers (the line that is #symbols_list...read_csv..), and fetch stock-info on that date into a pandas.
import datetime
import pandas as pd
from pandas import DataFrame
from pandas.io.data import DataReader
#symbols_list = [pd.read_csv('Stock List.csv', index_col=0)]
symbols_list = ['AAPL', 'TSLA', 'YHOO','GOOG', 'MSFT','ALTR','WDC','KLAC']
symbols=[]
start = datetime.datetime(2014, 2, 9)
#end = datetime.datetime(2014, 12, 30)
for ticker in symbols_list:
r = DataReader(ticker, "yahoo",
start = start)
#start=start, end)
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate all the dfs
df = pd.concat(symbols)
#define cell with the columns that i need
cell= df[['Symbol','Open','High','Low','Adj Close','Volume']]
#changing sort of Symbol (ascending) and Date(descending) setting Symbol as first column and changing date format
cell.reset_index().sort(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv('stock.csv', date_format='%d/%m/%Y')
The input file Stock list.csv
has the following content with these entries on each their separate row:
Index
MMM
ABT
ABBV
ACE
ACN
ACT
ADBE
ADT
AES
AET
AFL
AMG
and many more tickers of interest.
When run with the manually coded list
symbols_list = ['AAPL', 'TSLA', 'YHOO','GOOG', 'MSFT','ALTR','WDC','KLAC']
It all works fine and processes the input and stores it to a file,
But whenever I run the code with the read_csv from file, I get the following error:
runfile('Z:/python/CrystallBall/SpyderProject/getstocks3.py', wdir='Z:/python/CrystallBall/SpyderProject') Reloaded modules: pandas.io.data, pandas.tseries.common Traceback (most recent call last):
File "<ipython-input-32-67cbdd367f48>", line 1, in <module>
runfile('Z:/python/CrystallBall/SpyderProject/getstocks3.py', wdir='Z:/python/CrystallBall/SpyderProject')
File "C:\Program Files (x86)\WinPython-32bit-3.4.2.4\python-3.4.2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 601, in runfile
execfile(filename, namespace)
File "C:\Program Files (x86)\WinPython-32bit-3.4.2.4\python-3.4.2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 80, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "Z:/python/CrystallBall/SpyderProject/getstocks3.py", line 35, in <module>
cell.reset_index().sort(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv('stock.csv', date_format='%d/%m/%Y')
File "C:\Users\Morten\AppData\Roaming\Python\Python34\site-packages\pandas\core\generic.py", line 1947, in __getattr__
(type(self).__name__, name))
AttributeError: 'Panel' object has no attribute 'reset_index'
Why can I only process the symbol_list manually laid out, and not the imported tickers from file?
Any takers? Any help greatly appreciated!
Your code has numerous issues which the following code has fixed and works:
In [4]:
import datetime
import pandas as pd
from pandas import DataFrame
from pandas.io.data import DataReader
temp='''Index
MMM
ABT
ABBV
ACE
ACN
ACT
ADBE
ADT
AES
AET
AFL
AMG'''
df = pd.read_csv(io.StringIO(temp), index_col=[0])
symbols=[]
start = datetime.datetime(2014, 2, 9)
for ticker in df.index:
r = DataReader(ticker, "yahoo",
start = start)
#start=start, end)
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate all the dfs
df = pd.concat(symbols)
#define cell with the columns that i need
cell= df[['Symbol','Open','High','Low','Adj Close','Volume']]
#changing sort of Symbol (ascending) and Date(descending) setting Symbol as first column and changing date format
cell.reset_index().sort(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv('stock.csv', date_format='%d/%m/%Y')
cell
Out[4]:
Symbol Open High Low Adj Close Volume
Date
2014-02-10 MMM 129.65 130.41 129.02 126.63 3317400
2014-02-11 MMM 129.70 131.49 129.65 127.88 2604000
... ... ... ... ... ... ...
2015-02-06 AMG 214.35 215.82 212.64 214.45 424400
[3012 rows x 6 columns]
So firstly this: symbols_list = [pd.read_csv('Stock List.csv', index_col=0)]
This will create a list with a single entry which will be a df with no columns and just an index of your ticker values.
This: for ticker in symbols_list:
won't work because the iterable object that is returned from the df is the column and not each entry, in your case you need to iterate over the index which is what my code does.
I'm not sure what you wanted to achieve, it isn't necessary to specify that index_col=0 if there is only one column, you can either create a df with just a single column, or if you pass squeeze=True this will create a Series which just has a single column.
I'm trying to read in a dataframe created via df.to_json() via pd.read_json but I'm getting a ValueError. I think it may have to do with the fact that the index is a MultiIndex but I'm not sure how to deal with that.
The original dataframe of 55k rows is called psi and I created test.json via:
psi.head().to_json('test.json')
Hereis the output of print psi.head().to_string() if you want to use that.
When I do it on this small set of data (5 rows), I get a ValueError.
! wget --no-check-certificate https://gist.githubusercontent.com/olgabot/9897953/raw/c270d8cf1b736676783cc1372b4f8106810a14c5/test.json
import pandas as pd
pd.read_json('test.json')
Here's the full stack:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-1de2f0e65268> in <module>()
1 get_ipython().system(u' wget https://gist.githubusercontent.com/olgabot/9897953/raw/c270d8cf1b736676783cc1372b4f8106810a14c5/test.json'>)
2 import pandas as pd
----> 3 pd.read_json('test.json')
/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit)
196 obj = FrameParser(json, orient, dtype, convert_axes, convert_dates,
197 keep_default_dates, numpy, precise_float,
--> 198 date_unit).parse()
199
200 if typ == 'series' or obj is None:
/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in parse(self)
264
265 else:
--> 266 self._parse_no_numpy()
267
268 if self.obj is None:
/home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.pyc in _parse_no_numpy(self)
481 if orient == "columns":
482 self.obj = DataFrame(
--> 483 loads(json, precise_float=self.precise_float), dtype=None)
484 elif orient == "split":
485 decoded = dict((str(k), v)
ValueError: No ':' found when decoding object value
> /home/obot/virtualenvs/envy/lib/python2.7/site-packages/pandas/io/json.py(483)_parse_no_numpy()
482 self.obj = DataFrame(
--> 483 loads(json, precise_float=self.precise_float), dtype=None)
484 elif orient == "split":
But when I do it on the whole dataframe (55k rows) then I get an invalid pointer error and the IPython kernel dies. Any ideas?
EDIT: added how the json was generated in the first place.
This is not implemented ATM, see the issue here: https://github.com/pydata/pandas/issues/4889.
You can simply reset the index first, e.g
df.reset_index().to_json(...)
and it will work.
Or you can just write json with orient = 'table'
df.to_json(path_or_buf='test.json', orient='table')
read multi_index json
pd.read_json('test.json', orient='table')
if you want to return MultiIndex structure:
# save MultiIndex indexes names
indexes_names = df.index.names
df.reset_index().to_json('dump.json')
# return back MultiIndex structure:
loaded_df = pd.read_json('dump.json').set_index(indexes_names)
this was my simple dirty fix for encoding/decoding multiindex pandas dataframe which seems to also work for datetime in index/columns... not optimized!
here is the encoder to json - I encoder the dataframe, index and columns into a dict to create a json
import json
import pandas as pd
def to_json_multiindex(df):
dfi = df.index.to_frame()
dfc = df.columns.to_frame()
d = dict(
df = df.to_json(),
di = dfi.to_json(),
dc = dfc.to_json()
)
return json.dumps(d)
meanwhile here is the decoder which reads the json dict and re-creates the dataframe
def read_json_multiindex(j):
d = json.loads(j)
di=pd.read_json(d['di'])
if di.shape[1]>1:
di = pd.MultiIndex.from_frame(di)
else:
_name = di.columns[0]
di = di.index
di.name = _name
dc=pd.read_json(d['dc'])
if dc.shape[1]>1:
dc = pd.MultiIndex.from_frame(dc)
else:
_name = dc.columns[0]
dc = dc.index
dc.name = _name
df = pd.read_json(d['df']).values
return pd.DataFrame(
data=df,
index=di,
columns=dc,
)
and here is a test for multiindex columns and index... seems to preserve the dataframe. Couple of issues 1) probably inefficient and 2) does seem to work for datatime in multiindex (but works when it isn't multiindex)
df = pd.DataFrame(
data = [[0,1,2],[2,3,4],[5,6,7]],
index = pd.MultiIndex.from_tuples(
(('aa','bb'),('aa','cc'),('bb','cc')
),
names=['AA','BB']),
columns = pd.MultiIndex.from_tuples(
(('XX','YY'),('XX','ZZ'),('YY','ZZ')
),
names=['YY','ZZ'])
)
j = to_json_multiindex(df)
d = read_json_multiindex(j)