Pyarrow table memory compared to raw csv size - pyarrow

I have a 2GB CSV file that I read into a pyarrow table with the following:
from pyarrow import csv
tbl = csv.read_csv(path)
When I call tbl.nbytes I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe I have a fundamental misunderstanding of what pyarrow is doing under the hood but I thought if anything it would be smaller due to its columnar nature (i also probably could have squeezed out more gains using ConvertOptions but i wanted a baseline). I definitely wasnt expecting an increase of almost 75%. Also when I convert it from arrow table to pandas df the df took up roughly the same amount of memory as the csv - which was expected.
Can anyone help explain the difference in memory for arrow tables compared to a csv / pandas df.
Thx.
UPDATE
Full code and output below.
In [2]: csv.read_csv(r"C:\Users\matth\OneDrive\Data\Kaggle\sf-bay-area-bike-shar
...: e\status.csv")
Out[2]:
pyarrow.Table
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [3]: tbl = csv.read_csv(r"C:\Users\generic\OneDrive\Data\Kaggle\sf-bay-area-bik
...: e-share\status.csv")
In [4]: tbl.schema
Out[4]:
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [5]: tbl.nbytes
Out[5]: 3419272022
In [6]: tbl.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 station_id int64
1 bikes_available int64
2 docks_available int64
3 time object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB

There are two problems:
The integers columns are using int64, but int32 would be more appropriate (unless the values are big)
The time column is interpreted as a string. It doesn't help that the input format isn't following any standard (%Y/%m/%d %H:%M:%S)
The first problem is easy to solve, using ConvertionOptions:
tbl = csv.read_csv(
<path>,
convert_options=csv.ConvertOptions(
column_types={
'station_id': pa.int32(),
'bikes_available': pa.int32(),
'docks_available': pa.int32(),
'time': pa.string()
}))
The second one is a bit more complicated because as far as I can tell the read_csv API doesn't let you provide a format for the time column, and there's no easy way to convert string columns to datetime in pyarrow. So you have to use pandas instead:
series = tbl.column('time').to_pandas()
series_as_datetime = pd.to_datetime(series, format='%Y/%m/%d %H:%M:%S')
tbl2 = pa.table(
{
'station_id':tbl.column('station_id'),
'bikes_available':tbl.column('bikes_available'),
'docks_available':tbl.column('docks_available'),
'time': pa.chunked_array([series_as_datetime])
})
tbl2.nbytes
>>> 1475683759
1475683759 is the number you expect, you can't get any better. Each row is 20 bytes (4 + 4 + 4 + 8).

Related

Parse nested json data in dataframe

I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)

Pytorch: Overfitting on a small batch: Debugging

I am building a multi-class image classifier.
There is a debugging trick to overfit on a single batch to check if there any deeper bugs in the program.
How to design the code in a way that can do it in a much portable format?
One arduous and a not smart way is to build a holdout train/test folder for a small batch where test class consists of 2 distribution - seen data and unseen data and if the model is performing better on seen data and poorly on unseen data, then we can conclude that our network doesn't have any deeper structural bug.
But, this does not seems like a smart and a portable way, and have to do it with every problem.
Currently, I have a dataset class where I am partitioning the data in train/dev/test in the below way -
def split_equal_into_val_test(csv_file=None, stratify_colname='y',
frac_train=0.6, frac_val=0.15, frac_test=0.25,
):
"""
Split a Pandas dataframe into three subsets (train, val, and test).
Following fractional ratios provided by the user, where val and
test set have the same number of each classes while train set have
the remaining number of left classes
Parameters
----------
csv_file : Input data csv file to be passed
stratify_colname : str
The name of the column that will be used for stratification. Usually
this column would be for the label.
frac_train : float
frac_val : float
frac_test : float
The ratios with which the dataframe will be split into train, val, and
test data. The values should be expressed as float fractions and should
sum to 1.0.
random_state : int, None, or RandomStateInstance
Value to be passed to train_test_split().
Returns
-------
df_train, df_val, df_test :
Dataframes containing the three splits.
"""
df = pd.read_csv(csv_file).iloc[:, 1:]
if frac_train + frac_val + frac_test != 1.0:
raise ValueError('fractions %f, %f, %f do not add up to 1.0' %
(frac_train, frac_val, frac_test))
if stratify_colname not in df.columns:
raise ValueError('%s is not a column in the dataframe' %
(stratify_colname))
df_input = df
no_of_classes = 4
sfact = int((0.1*len(df))/no_of_classes)
# Shuffling the data frame
df_input = df_input.sample(frac=1)
df_temp_1 = df_input[df_input['labels'] == 1][:sfact]
df_temp_2 = df_input[df_input['labels'] == 2][:sfact]
df_temp_3 = df_input[df_input['labels'] == 3][:sfact]
df_temp_4 = df_input[df_input['labels'] == 4][:sfact]
dev_test_df = pd.concat([df_temp_1, df_temp_2, df_temp_3, df_temp_4])
dev_test_y = dev_test_df['labels']
# Split the temp dataframe into val and test dataframes.
df_val, df_test, dev_Y, test_Y = train_test_split(
dev_test_df, dev_test_y,
stratify=dev_test_y,
test_size=0.5,
)
df_train = df[~df['img'].isin(dev_test_df['img'])]
assert len(df_input) == len(df_train) + len(df_val) + len(df_test)
return df_train, df_val, df_test
def train_val_to_ids(train, val, test, stratify_columns='labels'): # noqa
"""
Convert the stratified dataset in the form of dictionary : partition['train] and labels.
To generate the parallel code according to https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel
Parameters
-----------
csv_file : Input data csv file to be passed
stratify_columns : The label column
Returns
-----------
partition, labels:
partition dictionary containing train and validation ids and label dictionary containing ids and their labels # noqa
"""
train_list, val_list, test_list = train['img'].to_list(), val['img'].to_list(), test['img'].to_list() # noqa
partition = {"train_set": train_list,
"val_set": val_list,
}
labels = dict(zip(train.img, train.labels))
labels.update(dict(zip(val.img, val.labels)))
return partition, labels
P.S - I know about the Pytorch lightning and know that they have an overfitting feature which can be used easily but I don't want to move to PyTorch lightning.
I don't know how portable it will be, but a trick that I use is to modify the __len__ function in the Dataset.
If I modified it from
def __len__(self):
return len(self.data_list)
to
def __len__(self):
return 20
It will only output the first 20 elements in the dataset (regardless of shuffle). You only need to change one line of code and the rest should work just fine so I think it's pretty neat.

Write very big DataFrame into text file or split Dataframe

I have a dataframe, its shape is "(4255300, 10)". I have to open this into csv file, but due to size restrictions of EXcel, this is not possible.
I tried to split df row-wise (Pandas: split dataframe into multiple dataframes by number of rows) but only index numbers are getting inserted into splits(I wrote those splits into csv files).
Also I tried to write this df into text file, (np.savetxt('desktop/s2.txt', z.values, fmt='%d', delimiter="\t") ) but wrong data is getting inserted into text file.
There is no issue with width of df, only problem is length of it i.e.number of rows.
Can anyone help me with this?
You could split the DataFrame into smaller chunks and then export like this:
# Creating a DataFrame with some numbers
df = pd.DataFrame(np.random.randint(0,100,size=(42000, 10)), index=np.arange(0,42000)).reset_index()
# Setting my chunk size
chunk_size = 10000
# Assigning chunk numbers to rows
df['chunk'] = df['index'].apply(lambda x: int(x / chunk_size))
# We don't want the 'chunk' and 'index' columns in the output
cols = [col for col in df.columns if col not in ['chunk', 'index']]
# groupby chunk and export each chunk to a different csv.
i = 0
for _, chunk in df.groupby('chunk'):
chunk[cols].to_csv(f'chunk{i}.csv', index=False)
i += 1

Reading NaN values from .csv files with decode_csv()

I have .csv file with integer values, that can have NA value which represents missing data.
Example file:
-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179
When trying to read it with
import tensorflow as tf
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)
It throws following error later on sess.run(_) on the 2nd iteration
InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA
Is there a way to interpret string "NA" while reading csv as NaN or similar value in TensorFlow?
I recently ran into the same problem. I solved it by reading the CSV as strings, replacing every occurrence of "NA" with some valid value, then converting it to float
# Set up reading from CSV files
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
NUM_COLUMNS = XX # Specify number of expected columns
# Read values as string, set "NA" for missing values.
record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS
decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
# Replace every occurrence of "NA" with "-1"
no_nan = tf.where(tf.equal(decoded, "NA"), ["-1"]*NUM_COLUMNS, decoded)
# Convert to float, combine to a single tensor with stack.
float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))
But long term I plan switching to tfrecords because reading csv is too slow for my needs

How to read a pandas Series from a CSV file

I have a CSV file formatted as follows:
somefeature,anotherfeature,f3,f4,f5,f6,f7,lastfeature
0,0,0,1,1,2,4,5
And I try to read it as a pandas Series (using pandas daily snapshot for Python 2.7).
I tried the following:
import pandas as pd
types = pd.Series.from_csv('csvfile.txt', index_col=False, header=0)
and:
types = pd.read_csv('csvfile.txt', index_col=False, header=0, squeeze=True)
But both just won't work: the first one gives a random result, and the second just imports a DataFrame without squeezing.
It seems like pandas can only recognize as a Series a CSV formatted as follows:
f1, value
f2, value2
f3, value3
But when the features keys are in the first row instead of column, pandas does not want to squeeze it.
Is there something else I can try? Is this behaviour intended?
Here is the way I've found:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.ix[0,:]
Seems like a bit stupid to me as Squeeze should already do this. Is this a bug or am I missing something?
/EDIT: Best way to do it:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.transpose()[0] # here we convert the DataFrame into a Serie
This is the most stable way to get a row-oriented CSV line into a pandas Series.
BTW, the squeeze=True argument is useless for now, because as of today (April 2013) it only works with row-oriented CSV files, see the official doc:
http://pandas.pydata.org/pandas-docs/dev/io.html#returning-series
This works. Squeeze still works, but it just won't work alone. The index_col needs to be set to zero as below
series = pd.read_csv('csvfile.csv', header = None, index_col = 0, squeeze = True)
In [28]: df = pd.read_csv('csvfile.csv')
In [29]: df.ix[0]
Out[29]:
somefeature 0
anotherfeature 0
f3 0
f4 1
f5 1
f6 2
f7 4
lastfeature 5
Name: 0, dtype: int64
ds = pandas.read_csv('csvfile.csv', index_col=False, header=0);
X = ds.iloc[:, :10] #ix deprecated
As Pandas value selection logic is :
DataFrame -> Series=DataFrame[Column] -> Values=Series[Index]
So I suggest :
df=pandas.read_csv("csvfile.csv")
s=df[df.columns[0]]
from pandas import read_csv
series = read_csv('csvfile.csv', header=0, parse_dates=[0], index_col=0, squeeze=True
Since none of the answers above worked for me, here is another one, recreating the Series manually from the DataFrame.
# create example series
series = pd.Series([0, 1, 2], index=["a", "b", "c"])
series.index.name = "idx"
print(series)
print()
# create csv
series_csv = series.to_csv()
print(series_csv)
# read csv
df = pd.read_csv(io.StringIO(series_csv), index_col=0)
indx = df.index
vals = [df.iloc[i, 0] for i in range(len(indx))]
series_again = pd.Series(vals, index=indx)
print(series_again)
Output:
idx
a 0
b 1
c 2
dtype: int64
idx,0
a,0
b,1
c,2
idx
a 0
b 1
c 2
dtype: int64