How to write and read ndarray to pyarrow plasma store? - pyarrow

I'm trying to write an ndarray into pyarrow plasma store and read it in another process.
I've googled but I wasn't able find a similar snipped code.
The ndrarray is an RGB video frame.
To write to the store I do:
buffer_id = plasma.ObjectID.from_random()
tensor = pa.Tensor.from_numpy(frame)
data_size = pa.get_tensor_size(tensor)
buf = plasma_client.create(buffer_id, data_size)
array = np.frombuffer(buf).reshape(frame.shape)
array[:] = frame
plasma_client.seal(buffer_id)
and to read I do:
for object_id in plasma_client.list():
[buff] = plasma_client.get_buffers([object_id])
reader = pa.BufferReader(buff)
tensor2 = pa.read_tensor(reader)
frame = tensor2.to_numpy()
process_frame(frame)
and I'm getting the following error:
Traceback (most recent call last):
File "writer.py", line 68, in write_to_plasma
array = np.frombuffer(buf).reshape(frame.shape)
ValueError: cannot reshape array of size 1154768 into shape (1458,2112,3)
Am I missing something here for the data format, do I need to reshape the image manually?

According to the documentation, you should write the pyarrow Tensor directly into the buffer:
import pyarrow.ipc as ipc
buffer_id = plasma.ObjectID.from_random()
tensor = pa.Tensor.from_numpy(frame)
data_size = ipc.get_tensor_size(tensor)
buf = plasma_client.create(buffer_id, data_size)
stream = pa.FixedSizeBufferWriter(buf)
ipc.write_tensor(tensor, stream)
plasma_client.seal(buffer_id)
Looking at your code, this part doesn't make sense:
array = np.frombuffer(buf).reshape(frame.shape)
I don't think you can read/write the numpy array directly form the buffer given arrow adds some metada to it.
You can get the tensor back this way:
[buf2] = plasma_client.get_buffers([buffer_id])
reader = pa.BufferReader(buf2)
tensor2 = pa.ipc.read_tensor(reader)

Related

Why must use DataParallel when testing?

Train on the GPU, num_gpus is set to 1:
device_ids = list(range(num_gpus))
model = NestedUNet(opt.num_channel, 2).to(device)
model = nn.DataParallel(model, device_ids=device_ids)
Test on the CPU:
model = NestedUNet_Purn2(opt.num_channel, 2).to(dev)
device_ids = list(range(num_gpus))
model = torch.nn.DataParallel(model, device_ids=device_ids)
model_old = torch.load(path, map_location=dev)
pretrained_dict = model_old.state_dict()
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
model_dict.update(pretrained_dict)
model.load_state_dict(model_dict)
This will get the correct result, but when I delete:
device_ids = list(range(num_gpus))
model = torch.nn.DataParallel(model, device_ids=device_ids)
the result is wrong.
nn.DataParallel wraps the model, where the actual model is assigned to the module attribute. That also means that the keys in the state dict have a module. prefix.
Let's look at a very simplified version with just one convolution to see the difference:
class NestedUNet(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
model = NestedUNet()
model.state_dict().keys() # => odict_keys(['conv1.weight', 'conv1.bias'])
# Wrap the model in DataParallel
model_dp = nn.DataParallel(model, device_ids=range(num_gpus))
model_dp.state_dict().keys() # => odict_keys(['module.conv1.weight', 'module.conv1.bias'])
The state dict you saved with nn.DataParallel does not line up with the regular model's state. You are merging the current state dict with the loaded state dict, that means that the loaded state is ignored, because the model does not have any attributes that belong to the keys and instead you are left with the randomly initialised model.
To avoid making that mistake, you shouldn't merge the state dicts, but rather directly apply it to the model, in which case there will be an error if the keys don't match.
RuntimeError: Error(s) in loading state_dict for NestedUNet:
Missing key(s) in state_dict: "conv1.weight", "conv1.bias".
Unexpected key(s) in state_dict: "module.conv1.weight", "module.conv1.bias".
To make the state dict that you have saved compatible, you can strip off the module. prefix:
pretrained_dict = {key.replace("module.", ""): value for key, value in pretrained_dict.items()}
model.load_state_dict(pretrained_dict)
You can also avoid this issue in the future by unwrapping the model from nn.DataParallel before saving its state, i.e. saving model.module.state_dict(). So you can always load the model first with its state and then later decide to put it into nn.DataParallel if you wanted to use multiple GPUs.
You trained your model using DataParallel and saved it. So, the model weights were stored with a module. prefix. Now, when you load without DataParallel, you basically are not loading any model weights (the model has random weights). As a result, the model predictions are wrong.
I am giving an example.
model = nn.Linear(2, 4)
model = torch.nn.DataParallel(model, device_ids=device_ids)
model.state_dict().keys() # => odict_keys(['module.weight', 'module.bias'])
On the other hand,
another_model = nn.Linear(2, 4)
another_model.state_dict().keys() # => odict_keys(['weight', 'bias'])
See the difference in the OrderedDict keys.
So, in your code, the following three-line works but no model weights are loaded.
pretrained_dict = model_old.state_dict()
model_dict = model.state_dict()
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
Here, model_dict has keys without the module. prefix but pretrained_dict has when you do not use DataParalle. So, essentially pretrained_dict is empty when DataParallel is not used.
Solution: If you want to avoid using DataParallel, or you can load the weights file, create a new OrderedDict without the module prefix, and load it back.
Something like the following would work for your case without using DataParallel.
# original saved file with DataParallel
model_old = torch.load(path, map_location=dev)
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in model_old.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)

Trying to get CSV ready for keras model with tensorflow dataset

I do have a keras CNN model ready which expects [None,20,20,3] arrays as input. (20 is image size here...) On the other side I do have a CSV with 1200 (20*20*3) columns ready in my cloud storage.
I want to write an ETL pipeline with tensorflow to obtain a [20,20,3] shape tensor for each row in the csv.
My code so far:
I spent days of work already and feel confident, that this approoach might work out in the end.
import tensorflow as tf
BATCH_SIZE = 30
tf.enable_eager_execution()
X_csv_path = 'gs://my-bucket/dataX.csv'
X_dataset = tf.data.experimental.make_csv_dataset(X_csv_path, BATCH_SIZE, column_names=range(1200) , header=False)
X_dataset = X_dataset.map(lambda x: tf.stack(list(x.values())))
iterator = X_dataset.make_one_shot_iterator()
image = iterator.get_next()
I would expect to have a [30,1200] shape but I still get 1200 tensors of shape [30] instead. My idea is to read every line into a [1200] shaped tensor and then reshape the line to a [20,20,3] tensor to feed my model with. Thanks for your time!
tf.data.experimental.make_csv_dataset creates a OrderedDict of column arrays. For your task I'd use tf.data.TextLineDataset.
def parse(filename):
string = tf.strings.split([filename], sep=',').values
return string
dataset = tf.data.TextLineDataset('sample.csv').map(parse).batch(BATCH_SIZE)
for i in dataset:
print(i)
This will output tensor of shape (BATCH_SIZE, row_length), where row_length is a row from csv file. You can apply any additional preprocessing, depending on your task

Is it possible to access a solver propertiese through Pycaffe?

Is it possible to read and access the solver properties in Pycaffe?
I need to use some of the information stored in the solver file, but apparently the solver object which is created using
import caffe
solver = caffe.get_solver(solver_path)
is of no use in this case. Is there any other way to get around this problem?
I couldn't find what I was after using solver object and I ended up writing a quick function to get around this issue:
def retrieve_field(solver_path, field=None):
'''
Returns a specific solver parameter value using the specified field
or the whole content of the solver file, when no field is provided.
returns:
a string, as a field value or the whole content as list
'''
lines = []
field_segments = []
with open(solver_path, 'r') as file:
for line in file:
line = line.strip()
lines.append(line)
field_segments = line.split(':')
if (field_segments[0] == field):
#if that line contains # marks (for comments)
if('#' in field_segments[-1]):
idx = field_segments[-1].index('#')
return field_segments[-1][0:idx]
else:
return field_segments[-1]
return lines

Variable length output in keras

I'm trying to create an autoencoder in keras with bucketing where the input and the output have different time steps.
model = Sequential()
#encoder
model.add(Embedding(vocab_size, embedding_size, mask_zero=True))
model.add(LSTM(units=hidden_size, return_sequences=False))
#decoder
model.add(RepeatVector(max_out_length))
model.add(LSTM(units=hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(num_class, activation='softmax')))
For the input there is no problem as the network can accept different length inputs as long as the whole batch has the same length. However the problem is with the output size as its determined by the RepeatVector length and there is not easy way to change it.
Is there a solution for such a problem?
If you mean "inputs with variable lengths" and "outputs with the same lengths as the inputs", you can do this:
Warning: this solution must work with batch size = 1
You will need to create an external loop and pass each sample as a numpy array with the exact length
You cannot use masking in this solution, and the right output depends on the correct length of the input
This is a working code using Keras + Tensorflow:
Imports:
from keras.layers import *
from keras.models import Model
import numpy as np
import keras.backend as K
from keras.utils.np_utils import to_categorical
Custom functions to use in Lambda layers:
#this function gets the length from the original input
#and stores it in the final output of the encoder
def storeLength(x):
inputTensor = x[0]
storeInto = x[1] #the final output
length = K.shape(inputTensor)[1]
length = K.cast(length,K.floatx())
length = K.reshape(length,(1,1))
#will put length as the first element in the final output
return K.concatenate([length,storeInto])
#this function expands the length of the input in the decoder
def expandLength(x):
#lenght is the first element in the encoded input
length = K.cast(x[0,0],'int32') #or int64 if necessary
#the remaining elements are the actual data to be decoded
data = x[:,1:]
#a tensor with shape (length,)
length = K.ones_like(K.arange(0,length))
#make both length tensor and data tensor 3D and with paired dimensions
length = K.cast(K.reshape(length,(1,-1,1)),K.floatx())
data = K.reshape(data,(1,1,-1))
#this automatically repeats the elements based on the paired shapes
return data*length
Creating the models:
I assumed the output is equal to the input, but since you're using an Embedding, I made "num_classes" equal to the number of words.
For this solution, we use a branching, thus I had to use the functional API Model. Which will be way better later, because you will want to train with autoencoder.train_on_batch and then just encode with encoder.predict() or just decode with decoder.predict().
vocab_size = 100
embedding_size = 7
num_class=vocab_size
hidden_size = 3
#encoder
inputs = Input(batch_shape = (1,None))
outputs = Embedding(vocab_size, embedding_size)(inputs)
outputs = LSTM(units=hidden_size, return_sequences=False)(outputs)
outputs = Lambda(storeLength)([inputs,outputs])
encoder = Model(inputs,outputs)
#decoder
inputs = Input(batch_shape=(1,hidden_size+1))
outputs = Lambda(expandLength)(inputs)
outputs = LSTM(units=hidden_size, return_sequences=True)(outputs)
outputs = TimeDistributed(Dense(num_class, activation='softmax'))(outputs)
decoder = Model(inputs,outputs)
#autoencoder
inputs = Input(batch_shape=(1,None))
outputs = encoder(inputs)
outputs = decoder(outputs)
autoencoder = Model(inputs,outputs)
#see each model's shapes
encoder.summary()
decoder.summary()
autoencoder.summary()
Just an example with fake data and the method that should be used for training:
inputData = []
outputData = []
for i in range(7,10):
inp = np.arange(i).reshape((1,i))
inputData.append(inp)
outputData.append(to_categorical(inp,num_class))
autoencoder.compile(loss='mse',optimizer='adam')
for epoch in range(1):
for inputSample,outputSample in zip(inputData,outputData):
print(inputSample.shape,outputSample.shape)
autoencoder.train_on_batch(inputSample,outputSample)
for inputSample in inputData:
print(autoencoder.predict(inputSample).shape)

how to chunk a csv (dict)reader object in python 3.2?

I try to use Pool from the multiprocessing module to speed up reading in large csv files. For this, I adapted an example (from py2k), but it seems like the csv.dictreader object has no length. Does it mean I can only iterate over it? Is there a way to chunk it still?
These questions seemed relevant, but did not really answer my question:
Number of lines in csv.DictReader,
How to chunk a list in Python 3?
My code tried to do this:
source = open('/scratch/data.txt','r')
def csv2nodes(r):
strptime = time.strptime
mktime = time.mktime
l = []
ppl = set()
for row in r:
cell = int(row['cell'])
id = int(row['seq_ei'])
st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y'))
ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y'))
# collect list
l.append([(id,cell,{1:st,2: ed})])
# collect separate sets
ppl.add(id)
return (l,ppl)
def csv2graph(source):
r = csv.DictReader(source,delimiter=',')
MG=nx.MultiGraph()
l = []
ppl = set()
# Remember that I use integers for edge attributes, to save space! Dic above.
# start: 1
# end: 2
p = Pool(processes=4)
node_divisor = len(p._pool)*4
node_chunks = list(chunks(r,int(len(r)/int(node_divisor))))
num_chunks = len(node_chunks)
pedgelists = p.map(csv2nodes,
zip(node_chunks))
ll = []
for l in pedgelists:
ll.append(l[0])
ppl.update(l[1])
MG.add_edges_from(ll)
return (MG,ppl)
From the csv.DictReader documentation (and the csv.reader class it subclasses), the class returns an iterator. The code should have thrown a TypeError when you called len().
You can still chunk the data, but you'll have to read it entirely into memory. If you're concerned about memory you can switch from csv.DictReader to csv.reader and skip the overhead of the dictionaries csv.DictReader creates. To improve readability in csv2nodes(), you can assign constants to address each field's index:
CELL = 0
SEQ_EI = 1
DAT_DEB_OCCUPATION = 4
DAT_FIN_OCCUPATION = 5
I also recommend using a different variable than id, since that's a built-in function name.