How to save and load models of custom dataset in Detectron2? - deep-learning

I have tried to save and load the model using:
All keys are mapped but there is no prediction in output
#1
from detectron2.modeling import build_model
model = build_model(cfg)
torch.save(model.state_dict(), 'checkpoint.pth')
model.load_state_dict(torch.load(checkpoint_path,map_location='cpu'))
I also tried doing it using the official doc but can't understand the input format part
from detectron2.checkpoint import DetectionCheckpointer
DetectionCheckpointer(model).load(file_path_or_url) # load a file, usually from cfg.MODEL.WEIGHTS
checkpointer = DetectionCheckpointer(model, save_dir="output")
checkpointer.save("model_999") # save to output/model_999.pth

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file('COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml'))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5 # Set threshold for this model
cfg.MODEL.WEIGHTS = '/content/model_final.pth' # Set path model .pth
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
predictor = DefaultPredictor(cfg)
My code to load custom model works.

Related

Loading multiple csvs with mixed dtypes in tensorflow for training

I have 100s of csvs in a directory, with headers. I am trying to create a feedforward NN using tensorflow for regression.
What's the best way to import these csvs and train using tf & train it?
Also help to look at my preprocessing if I am doing it right?
Note: My features has mixed datatypes (int,float,string), My target is float
I can not concat the csv and use pandas to import, my data size is >50 GB so can not load it in-memory, have to read it iteratively from disc
Directory Path:
./data/train/ -> 100s of csvs
./data/test -> 100s of csvs
./data/valid -> 100s of csvs
Code:
Methodology:
Create Generator
Use Dataset API to load the data
Preprocess the Data (embedding, one-hot,etc)
Train fit
But, in generator I was able to give only output formats where the inputs/ outputs are homogeneous ddtypes.
Code:
def data_generator(file_list, batch_size = 2):
i = 0
while True:
if i*batch_size >= len(file_list): # This loop is used to run the generator indefinitely.
i = 0
np.random.shuffle(file_list)
else:
file_chunk = file_list[i*batch_size:(i+1)*batch_size]
data = []
labels = []
for file in file_chunk:
temp = pd.read_csv(open(file,'r')) # Change this line to read any other type of file
labels = temp.pop('ACTUAL_BOXES')
data.append(temp.values) # Convert column data to matrix like data with one channel
labels.append(labels)
data = np.asarray(data)
labels = np.asarray(labels)
yield data, labels # Here data will be mixed datatype arrays & lables will be a float dtype array
i = i + 1
#getting list of files inside the directory
train_file_list = np.sort(glob.glob('././data/train/*.csv'))
test_file_list = np.sort(glob.glob('././data/test/*.csv'))
val_file_list = np.sort(glob.glob('././data/val/*.csv'))
train_dataset = tf.data.Dataset.from_generator(data_generator,args= [train_file_list , batch_size = 2],
output_types = (tf.float32, tf.float32), #This is where I am struck
#my sample data and lables will be like this
data = ['a','b',1,2,3.14,2] #Mixed dtypes
lables = [1.0] #float
)
val_dataset = tf.data.Dataset.from_generator(data_generator,args= [val_file_list , batch_size = 2],
output_types = (tf.float32, tf.float32), #This is where I am struck
)
# Pre processing Part:
def encode_inputs(EMBEDDING_FEATURES,INDICATOR_FEATURES):
''' Function for encoding the deatures'''
encoded_features = []
for feature_name in EMBEDDING_FEATURES:
#Getting unique vocab list
vocabulary = np.array(list(flatten(vocab_list[feature_name])))
# categorical columns using the lists created above:
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
feature_name, vocabulary)
embedding_dims = int(math.sqrt(len(vocabulary)))
# create an embedding from the categorical column:
cat_emb = tf.feature_column.embedding_column(cat_col,8) #,dimension=embedding_dims
# add the embeddings to the list of feature columns
encoded_features.append(cat_emb)
for feature_name in INDICATOR_FEATURES:
#Getting unique vocab list
vocabulary = list(flatten(vocab_list[feature_name]))
# indicator columns using the lists created above:
ind_col = tf.feature_column.categorical_column_with_vocabulary_list(
feature_name, vocabulary)
# create an embedding from the categorical column:
cat_one_hot = tf.feature_column.indicator_column(ind_col)
# add the embeddings to the list of feature columns
encoded_features.append(cat_one_hot)
# create the input layer for the model
feature_layer = tf.keras.layers.DenseFeatures(encoded_features)
return feature_layer
# Opening JSON file that contains vocab list for str cols
f = open('./vocab_list.json') # File that contains the unique values of each feature
vocab_list = json.load(f)
features_layer = encode_inputs(EMBEDDING_FEATURES,INDICATOR_FEATURES)
# Model Part
model = tf.keras.models.Sequential([
features_layer,
tf.keras.layers.Dense(30, activation = 'relu'),
tf.keras.layers.Dense(1)
])
m_loss = tf.keras.losses.mean_squared_error
m_optimizer = tf.keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_dataset ,epochs = 10, validation_data = val_dataset )

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])
In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

How to handle large JSON file in Pytorch?

I am working on a time series problem. Different training time series data is stored in a large JSON file with the size of 30GB. In tensorflow I know how to use TF records. Is there a similar way in pytorch?
I suppose IterableDataset (docs) is what you need, because:
you probably want to traverse files without random access;
number of samples in jsons is not pre-computed.
I've made a minimal usage example with an assumption that every line of dataset file is a json itself, but you can change the logic.
import json
from torch.utils.data import DataLoader, IterableDataset
class JsonDataset(IterableDataset):
def __init__(self, files):
self.files = files
def __iter__(self):
for json_file in self.files:
with open(json_file) as f:
for sample_line in f:
sample = json.loads(sample_line)
yield sample['x'], sample['time'], ...
...
dataset = JsonDataset(['data/1.json', 'data/2.json', ...])
dataloader = DataLoader(dataset, batch_size=32)
for batch in dataloader:
y = model(batch)
Generally, you do not need to change/overload the default data.Dataloader.
What you should look into is how to create a custom data.Dataset.
Once you have your own Dataset that knows how to extract item-by-item from the json file, you feed it do the "vanilla" data.Dataloader and all the batching/multi-processing etc, is done for you based on your dataset provided.
If, for example, you have a folder with several json files, each containing several examples, you can have a Dataset that looks like:
import bisect
class MyJsonsDataset(data.Dataset):
def __init__(self, jfolder):
super(MyJsonsDataset, self).__init__()
self.filenames = [] # keep track of the jfiles you need to load
self.cumulative_sizes = [0] # keep track of number of examples viewed so far
# this is not actually python code - just pseudo code for you to follow
for each jsonfile in jfolder:
self.filenames.append(jsonfile)
l = number of examples in jsonfile
self.cumulative_sizes.append(self.cumulative_sizes[-1] + l)
# discard the first element
self.cumulative_sizes.pop(0)
def __len__(self):
return self.cumulative_sizes[-1]
def __getitem__(self, idx):
# first you need to know wich of the files holds the idx example
jfile_idx = bisect.bisect_right(self.cumulative_sizes, idx)
if jfile_idx == 0:
sample_idx = idx
else:
sample_idx = idx - self.cumulative_sizes[jfile_idx - 1]
# now you need to retrieve the `sample_idx` example from self.filenames[jfile_idx]
return retrieved_example

How to load a trained MXnet model?

I have trained a network using MXnet, but am not sure how I can save and load the parameters for later use. First I define and train the network:
dataIn = mx.sym.var('data')
fc1 = mx.symbol.FullyConnected(data=dataIn, num_hidden=100)
act1 = mx.sym.Activation(data=fc1, act_type="relu")
fc2 = mx.symbol.FullyConnected(data=act1, num_hidden=50)
act2 = mx.sym.Activation(data=fc2, act_type="relu")
fc3 = mx.symbol.FullyConnected(data=act2, num_hidden=25)
act3 = mx.sym.Activation(data=fc3, act_type="relu")
fc4 = mx.symbol.FullyConnected(data=act3, num_hidden=10)
act4 = mx.sym.Activation(data=fc4, act_type="relu")
fc5 = mx.symbol.FullyConnected(data=act4, num_hidden=2)
lenet = mx.sym.SoftmaxOutput(data=fc5, name='softmax',normalization = 'batch')
# create iterator around training and validation data
train_iter = mx.io.NDArrayIter(data=data[:ntrain], label = phen[:ntrain],batch_size=batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(data=data[ntrain:], label=phen[ntrain:], batch_size=batch_size)
# create a trainable module on GPU 0
lenet_model = mx.mod.Module(symbol=lenet, context=mx.gpu())
# train with the same
lenet_model.fit(train_iter,
eval_data=val_iter,
optimizer='adam',
optimizer_params={'learning_rate':0.00001},
eval_metric='f1',
batch_end_callback = mx.callback.Speedometer(batch_size, 10),
num_epoch=1000)
This model performs well on the test set, so I want to keep it. Next, I save the network layout and the parameterization:
lenet.save('./testNet_symbol.mxnet')
lenet_model.save_params('./testNet_module.mxnet')
All the documentation I can find on loading the network seem to have implemented the save function within the training routine, to save the network parameters at the end of each epoch. I haven't set these checkpoints during the training process Other methods use the mx.model.FeedForward class, which doesn't seem appropriate. Still other methods load the network from a .json file, which I don't have as a result of my save functions. How can I save/load a network after it's already finished training?
You just have to do this instead to save:
lenet_model.save_checkpoint('lenet', num_epoch, save_optimizer_states=True)
This would create 3 files if the states flag is set to True else 2 files:
.params (weights),
.json (symbol),
.states
And this to load:
lenet_model = mx.mod.Module.load(prefix,epoch)
lenet_model.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])

Can't load 2 models in pycaffe

I am trying to load several models in pycaffe so I can perform ensemble learning. I can load 1 model fine and perform tests on it, but when I try to load my second model it just stops.
I use the following code:
model_def_net0 = caffe_root + 'Dataset/net0/deploy.prototxt'
model_weights_net0 = caffe_root + 'Dataset/net0/net0_train_vgg_iter_250000.caffemodel'
net0 = caffe.Net(model_def_net0, # defines the structure of the model
model_weights_net0, # contains the trained weights
caffe.TEST) # use test mode (e.g., don't perform dropout)
model_def_caffe = caffe_root + 'Dataset/caffenet/deploy_3.prototxt'
model_weights_caffe = caffe_root + 'Dataset/caffenet/caffenet_train_iter_150000.caffemodel'
net1 = caffe.Net(model_def_caffe, # defines the structure of the model
model_weights_caffe, # contains the trained weights
caffe.TEST) # use test mode (e.g., don't perform dropout)
When I execute the code it will halt when trying to load net1 with the following in the log:
I0516 18:29:39.718916 7339 net.cpp:411] data -> data
I0516 18:29:39.718942 7339 net.cpp:411] data -> label