Memory error when using read_csv - csv

I'd like to convert the csv files into hdf5 format,which are used for caffe training.Because the csv files is 80G,it will report memory error.The machine memory is 128G.So can it possbile to improve my code?handle it one by one?Below is my code,it reported memory error when run in np.array
if '__main__' == __name__:
print 'Loading...'
day = sys.argv[1]
file = day+".xls"
data = pd.read_csv(file, header=None)
print data.iloc[0,1:5]
y = np.array(data.iloc[:,0], np.float32)
x = np.array(data.iloc[:,1:], np.float32)
patch = 100000
dirname = "hdf5_" + day
os.mkdir(dirname)
filename = dirname+"/hdf5.txt"
modelname = dirname+"/data"
file_w = open(filename, 'w')
for idx in range(int(math.ceil(y.shape[0]*1.0/patch))):
with h5py.File(modelname + str(idx) + '.h5', 'w') as f:
d_begin = idx*patch
d_end = min(y.shape[0], (idx+1)*patch)
f['data'] = x[d_begin:d_end,:]
f['label'] = y[d_begin:d_end]
file_w.write(modelname + str(idx) + '.h5\n')
file_w.close()

The best approach would be to read n lines and then write these to the HDF5 file, extending it be n elements each time. This way the amount of memory needed is not dependent on the size of the csv file. You could read a line at a time as well, but that would be slightly less efficient.
Here's code that applies this process for reading weather station data:
https://github.com/HDFGroup/datacontainer/blob/master/util/ghcn/convert_ghcn.py.

Actually, since you treat chunks of size 100000 separately, there is no need to load the whole CSV at one. The chunksize option in the read_csv is exactly for this case.
When specifying chunksize, read_csv will become an iterator, returning DataFrames's of size chunksize. You can iterate over instead of slicing the arrays each time.
Minus all the lines setting the different variables, your code should look more like this:
chuncks = pd.read_csv(file, header=None, chunksize=100000)
for chunk_number, data in enumerate(chunks):
y = np.array(data.iloc[:,0], np.float32)
x = np.array(data.iloc[:,1:], np.float32)
file_w = open(filename, 'w')
with h5py.File(modelname + str(idx) + '.h5', 'w') as f:
f['data'] = x
f['label'] = y
file_w.write(modelname + str(chunk_number) + '.h5\n')
file_w.close()

Related

Loading multiple csvs with mixed dtypes in tensorflow for training

I have 100s of csvs in a directory, with headers. I am trying to create a feedforward NN using tensorflow for regression.
What's the best way to import these csvs and train using tf & train it?
Also help to look at my preprocessing if I am doing it right?
Note: My features has mixed datatypes (int,float,string), My target is float
I can not concat the csv and use pandas to import, my data size is >50 GB so can not load it in-memory, have to read it iteratively from disc
Directory Path:
./data/train/ -> 100s of csvs
./data/test -> 100s of csvs
./data/valid -> 100s of csvs
Code:
Methodology:
Create Generator
Use Dataset API to load the data
Preprocess the Data (embedding, one-hot,etc)
Train fit
But, in generator I was able to give only output formats where the inputs/ outputs are homogeneous ddtypes.
Code:
def data_generator(file_list, batch_size = 2):
i = 0
while True:
if i*batch_size >= len(file_list): # This loop is used to run the generator indefinitely.
i = 0
np.random.shuffle(file_list)
else:
file_chunk = file_list[i*batch_size:(i+1)*batch_size]
data = []
labels = []
for file in file_chunk:
temp = pd.read_csv(open(file,'r')) # Change this line to read any other type of file
labels = temp.pop('ACTUAL_BOXES')
data.append(temp.values) # Convert column data to matrix like data with one channel
labels.append(labels)
data = np.asarray(data)
labels = np.asarray(labels)
yield data, labels # Here data will be mixed datatype arrays & lables will be a float dtype array
i = i + 1
#getting list of files inside the directory
train_file_list = np.sort(glob.glob('././data/train/*.csv'))
test_file_list = np.sort(glob.glob('././data/test/*.csv'))
val_file_list = np.sort(glob.glob('././data/val/*.csv'))
train_dataset = tf.data.Dataset.from_generator(data_generator,args= [train_file_list , batch_size = 2],
output_types = (tf.float32, tf.float32), #This is where I am struck
#my sample data and lables will be like this
data = ['a','b',1,2,3.14,2] #Mixed dtypes
lables = [1.0] #float
)
val_dataset = tf.data.Dataset.from_generator(data_generator,args= [val_file_list , batch_size = 2],
output_types = (tf.float32, tf.float32), #This is where I am struck
)
# Pre processing Part:
def encode_inputs(EMBEDDING_FEATURES,INDICATOR_FEATURES):
''' Function for encoding the deatures'''
encoded_features = []
for feature_name in EMBEDDING_FEATURES:
#Getting unique vocab list
vocabulary = np.array(list(flatten(vocab_list[feature_name])))
# categorical columns using the lists created above:
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
feature_name, vocabulary)
embedding_dims = int(math.sqrt(len(vocabulary)))
# create an embedding from the categorical column:
cat_emb = tf.feature_column.embedding_column(cat_col,8) #,dimension=embedding_dims
# add the embeddings to the list of feature columns
encoded_features.append(cat_emb)
for feature_name in INDICATOR_FEATURES:
#Getting unique vocab list
vocabulary = list(flatten(vocab_list[feature_name]))
# indicator columns using the lists created above:
ind_col = tf.feature_column.categorical_column_with_vocabulary_list(
feature_name, vocabulary)
# create an embedding from the categorical column:
cat_one_hot = tf.feature_column.indicator_column(ind_col)
# add the embeddings to the list of feature columns
encoded_features.append(cat_one_hot)
# create the input layer for the model
feature_layer = tf.keras.layers.DenseFeatures(encoded_features)
return feature_layer
# Opening JSON file that contains vocab list for str cols
f = open('./vocab_list.json') # File that contains the unique values of each feature
vocab_list = json.load(f)
features_layer = encode_inputs(EMBEDDING_FEATURES,INDICATOR_FEATURES)
# Model Part
model = tf.keras.models.Sequential([
features_layer,
tf.keras.layers.Dense(30, activation = 'relu'),
tf.keras.layers.Dense(1)
])
m_loss = tf.keras.losses.mean_squared_error
m_optimizer = tf.keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_dataset ,epochs = 10, validation_data = val_dataset )

Google Colab RAM issue with semi-supervised CNN model training

I'm trying to training a binary classifier by transfer learning on EfficientNet. Since I have lots of unlabeled data, I use semi-supervised method to generate multiple "pseudo labeled" data before the model go through each epoch.
Since Colab has its limits of RAM, I delete some large variables(like numpy arrays, dataset, dataloader...) in each loop, however the RAM still increase in every loop like the picture shown below.
Below is my Training loop which consists of 3 main structure: semi-supervised, training loop, validation loop.
I'm not sure which step cause the RAM to keep increase in each epoch.
(1) semi-supervised part
for epoch in range(n_epochs):
print(f"[ Epoch | {epoch + 1:03d}/{n_epochs:03d} ]")
if do_semi:
model.eval()
dataset_0 = []
dataset_1 = []
for img in pseudo_loader:
with torch.no_grad():
logits = model(img.to(device))
probs = softmax(logits)
# Filter the data and construct a new dataset.
for i in range(len(probs)):
p = probs[i].tolist()
idx = p.index(max(p))
if p[idx] >= threshold:
if idx == 0:
dataset_0.append(img[i].numpy().reshape(128, 128, 3))
else:
dataset_1.append(img[i].numpy().reshape(128, 128, 3))
# stratified sampling with labels
len_0, len_1 = len(dataset_0), len(dataset_1)
print('label 0: ', len_0)
print('label 1: ', len_1)
# since there may be RAM memory error, restrict to 1000
if len_0 > 1000:
dataset_0 = random.sample(dataset_0, 1000)
if len_1 > 1000:
dataset_1 = random.sample(dataset_1, 1000)
if len_0 == len_1:
pseudo_x = np.array(dataset_0 + dataset_1)
pseudo_y = ['0' for _ in range(len(dataset_0))] + ['1' for _ in range(len(dataset_1))]
elif len_0 > len_1:
dataset_0 = random.sample(dataset_0, len(dataset_1))
pseudo_x = np.array(dataset_0 + dataset_1)
pseudo_y = ['0' for _ in range(len(dataset_0))] + ['1' for _ in range(len(dataset_1))]
else:
dataset_1 = random.sample(dataset_1, len(dataset_0))
pseudo_x = np.array(dataset_0 + dataset_1)
pseudo_y = ['0' for _ in range(len(dataset_0))] + ['1' for _ in range(len(dataset_1))]
if len(pseudo_x) != 0:
new_dataset = CustomTensorDataset(pseudo_x, np.array(pseudo_y), 'pseudo')
else:
new_dataset = []
# print how many pseudo label data added
print('Total number of pseudo labeled data are added: ', len(new_dataset))
# release RAM
dataset_0 = None
dataset_1 = None
pseudo_x = None
pseudo_y = None
del dataset_0, dataset_1, pseudo_x, pseudo_y
gc.collect()
# Turn off the eval mode.
model.train()
concat_dataset = ConcatDataset([train_set, new_dataset])
train_loader = DataLoader(concat_dataset, batch_size=batch_size, shuffle=True)
i'm quiet sure the problem happened in semi-supervised part, since RAM usage did not increase when no apply semi-supervised part.
Thanks for your helps!!

Is num_epochs limited in tensorflow's csv file reader string_input_producer()?

I have a dummy csv file (y=-x+1)
x,y
1,0
2,-1
3,-2
I try to feed that into a linear regression model. Since I have only so few examples, I want to iterate the training like 1000 times over that file, so I set num_epochs=1000.
However, it seems that Tensorflow limits this number. It works fine if I use num_epochs=5 or 10, but beyond 33 it is capped to 33 epochs. Is that true or am Im doing anything wrong?
# model = W*x+b
...
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
# reading input from csv
filename_queue = tf.train.string_input_producer(["/tmp/testinput.csv"], num_epochs=1000)
reader = tf.TextLineReader(skip_header_lines=1)
...
col_x, col_label = tf.decode_csv(csv_row, record_defaults=record_defaults)
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
while True:
try:
input_x, input_y = sess.run([col_x, col_label])
sess.run(train, feed_dict={x:input_x, y:input_y})
...
Side question, do I need to do:
input_x, input_y = sess.run([col_x, col_label])
sess.run(train, feed_dict={x:input_x, y:input_y})
I have tried sess.run(train, feed_dict={x:col_x, y:col_y}) directly to avoid the friction but it doesn't work (they are nodes, and feed_dict expects regular data)
The following snippets works perfectly (with your input):
import tensorflow as tf
filename_queue = tf.train.string_input_producer(["/tmp/input.csv"], num_epochs=1000)
reader = tf.TextLineReader(skip_header_lines=1)
_, csv_row = reader.read(filename_queue)
col_x, col_label = tf.decode_csv(csv_row, record_defaults=[[0], [0]])
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
num = 0
try:
while True:
sess.run([col_x, col_label])
num += 1
except:
print(num)
Which gives the following output:
edb#lapelidb:/tmp$ python csv.py
3000

is there an easy way to estimate size of a json object?

I'm working on a security service that will return a list of permissions and I'm trying to estimate the size of the json response object. Here's a piece of sample data:
ID=123
VariableName=CanAccessSomeContent
I'm looking for an easy way to estimate what size of the json response object will be with 1500 rows. Is there an online estimation tool or some other technique I can use to easily get a rough size estimate?
Using Python you can estimate the size by creating the dictionary or just make one...
import json
import os
import sys
dict = {}
for a in range(0, 1500):
dict[a] = {'VariableName': 'CanAccessSomeContent'}
output = json.dumps(dict, indent = 4)
print ("Estimated size: " + str(sys.getsizeof(output) / 1024) + "KB")
with open( "test.json", 'wb') as outfile:
outfile.write(output)
print ("Actual size: " + str(os.path.getsize('test.json') / 1024) + "KB")
Output:
Estimated size: 100KB
Actual size: 99KB
I solved it, when I needed to, by adding a File-like object that just counted the characters and json.dump()ing into it:
# File-like object, throws away everything you write to it but keeps track of the size.
class MeterFile:
def __init__(self, size=0):
self.size = size
def write(self, string):
self.size += len(string)
# Calculates the JSON-encoded size of an object without storing it.
def json_size(obj, *args, **kwargs):
mf = MeterFile()
json.dump(obj, mf, *args, **kwargs)
return mf.size
The advantage is that encoding is not stored in memory, which could be large especially in cases you care about the size to begin with.
Function to estimate file size (Mash of JSON-Size & UTF-8 Length node repos)
function json_filesize (value) {
// returns object size in bytes
return (~-encodeURI(JSON.stringify(value)).split(/%..|./).length)/1048576
}
json_filesize({foo: 'bar'}) >> 13
I'm not sure if this is what you're after, as this seems extremely basic, but here goes:
First start with 0 rows, encode it and measure the size (We'll call this A).
Then, get a decent sample from your database and encode 1 row at a time.
For each of those outputs, calculate the size and then store the average (we'll call this B).
Now for X rows, the estimated json response will be X * (B-A) + A
So if A was 100 bytes, and B was 150 bytes, for 1500 rows we'll get:
1500 * (150-100) + 100 = 75100 bytes = 73 KB

pandas find constant variables in a huge csv file

I have a large csv file that I can not load into memory. I need to find which variables are constant. How can I do that?
I am reading the csv as
d = pd.read_csv(load_path, header=None, chunksize=10)
Is there an elegant way to solve the problem?
The data contains string and numerical variables
This is my current slow solution that does not use pandas
constant_variables = [True for i in range(number_of_columns)]
with open(load_path) as f:
line0 = next(f).split(',')
for num, line in enumerate(f):
line = line.split(',')
for i in range(n_col):
if line[i] != line0[i]:
constant_variables[i] = False
if num % 10000 == 0:
print(num)
You have 2 methods I can think of iterate over each column and check for uniqueness:
col_list = pd.read_csv(path, nrows=1).columns
for col in range(len(col_list)):
df = pd.read_csv(path, usecols=col)
if len(df.drop_duplicates()) == len(df):
print("all values are constant for: ", df.column[0])
or iterate over the csv in chunks and check again the lengths:
for df in pd.read_csv(path, chunksize=1000):
t = dict(zip(df, [len(df[col].value_counts()) for col in df]))
print(t)
The latter will read in chunks and tell you how unique each columns data is, this is just rough code which you can modify for your needs