Why am I facing OOM in dali/pytorch inference pipeline?

Why am I facing OOM in dali/pytorch inference pipeline? - deep-learning

I've a trained resnet 50 model and I'm using it for inference of medical images in jpeg2000 format. I'm using dali pipeline to speed up the process. But, I run into OOM after inferring 10 batches. I'm deleting tensors after each batch inference, still facing OOM.
Batch: 32,
image_size: 512 * 512,
GPU: P100 16GB.
#pipeline_def
def j2k_decode_pipeline():
jpegs, labels = fn.readers.file(files = j2kfiles)
images = fn.experimental.decoders.image(jpegs, device='mixed', output_type=types.RGB, dtype=DALIDataType.INT16)
images = fn.resize(
images,
resize_x=IMAGE_SIZE,
resize_y=IMAGE_SIZE,
resize_z=3,
interp_type=types.INTERP_LANCZOS3)
return images, labels
def get_preds_jk2(max_batch_size):
cnt = math.ceil(len(j2kfiles)/max_batch_size)-1
pipe = j2k_decode_pipeline(batch_size=max_batch_size, num_threads=2, device_id=0, debug=True)
pipe.build()
dali_iter = DALIGenericIterator(pipe, ['data', 'label'])
model_weights_path = '/model-2/model_weights.pth'
model = models.resnet50()
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 2)
model.load_state_dict(torch.load(model_weights_path))
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
preds, labels = [], []
for i, data in enumerate(dali_iter):
print('batch ', i)
labels.append(data[0]['label'])
# Testing correctness of labels
imgs = data[0]['data']
im_temp = imgs - imgs.min()
im_temp = im_temp / im_temp.max()
im_temp = im_temp * 255
imgs_8bit = im_temp.type(torch.uint8).float()
imgs_tensor = imgs_8bit.permute(0,3,1,2)
imgs_tensor = imgs_tensor.to(device)
del imgs, im_temp, imgs_8bit
gc.collect()
torch.cuda.empty_cache()
with torch.no_grad():
outputs = model(imgs_tensor)
output_probs = torch.nn.functional.softmax(outputs, dim=1).data.cpu().numpy()[:,1]
preds.append(output_probs)
del imgs_tensor, outputs, output_probs
gc.collect()
torch.cuda.empty_cache()
if i == cnt:
print(i)
break
return preds, labels

Related

My neural network does not appear to learn DQN

Good morning,
I'm trying to implement a DQN with Bellman's Equations, however I don't know why but every prediction seems random. Just below here is my code.
The neural network :
class simpleAgent(nn.Module) :
def __init__(self,taille_entree, taille_sortie):
super(simpleAgent, self).__init__()
self.fc1 = nn.Linear(taille_entree, 32)
self.fc2 = nn.Linear(32, taille_sortie)
self.beta = 0.999
self.epsilon = 0.99
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
def selectAction(self, pos):
if random.uniform(0,1) < self.epsilon:
self.epsilon = self.epsilon * self.beta
action = random.randint(0,1)
return action
else:
return torch.argmax(self(pos)).item()
The training code :
def optimize_model():
global model, targetNetwork
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
if len(buffer_circ) < BATCH_SIZE:
return
transitions = buffer_circ.sample(BATCH_SIZE)
batch = Transition(*zip(*transitions))
state = torch.stack(batch.state)
action = torch.stack(batch.action)
reward = torch.stack(batch.reward)
continu = torch.stack(batch.terminated)
next_state = torch.stack(batch.next_state)
q_value = torch.flatten(model(state).gather(1, action))
q_value_next = torch.max(targetNetwork(next_state), 1).values
cible = reward + torch.mul(discount_factor,torch.mul(continu, q_value_next))
loss = criterion(q_value, cible)
optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
def train():
global model, targetNetwork
env = gym.make('CartPole-v1', render_mode="rgb_array", disable_env_checker=True)
env.action_space.seed(42)
tab = []
for i in range(5000):
observation, info = env.reset()
rewards=[]
cumule = 0
update_target = 1
terminated = False
truncated = False
while not terminated :
old_observation = observation
action = model.selectAction(torch.from_numpy(observation))
observation, reward, terminated, truncated, info= env.step(action)
cumule += reward
reward = torch.tensor(reward)
terminated = terminated or truncated
state = torch.from_numpy(old_observation)
next_state = torch.from_numpy(observation)
if terminated:
print(i,cumule)
tab.append(cumule)
rewards.append(cumule)
break
buffer_circ.push(state, torch.tensor([action]), next_state, reward, torch.tensor(not terminated))
optimize_model()
if update_target % TARGET_UPDATE == 0:
targetNetwork.load_state_dict(model.state_dict())
update_target = 0
update_target += 1
print('Complete')
env.render()
env.close()
I'm using a target network that is updated every TARGET_UPDATE, and model is the neural network that I train and Target network a copy of it. Yet, I have the loss decreasing and I also implemented the greedy-exploration with selectAction. If you could tell what's wrong in my code, it would be nice.
Thank you in advance

RuntimeError: shape '[128, -1]' is invalid for input of size 378 pytorch

I'm running a spiking neural network for data that has 21 features with a batch size of 128. I get the following error after many iterations of training (this error doesn't arise immediately!):
RuntimeError: shape '[128, -1]' is invalid for input of size 378 pytorch
When I went to go print out what the shapes of the tensors are before, I get the following:
Train
torch.Size([128, 21])
Test
torch.Size([128, 21])
This is my network:
class SpikingNeuralNetwork(nn.Module):
"""
Parameters in SpikingNeuralNetwork class:
1. number_inputs: Number of inputs to the SNN.
2. number_hidden: Number of hidden layers.
3. number_outputs: Number of output classes.
4. beta: Decay rate.
"""
def __init__(self, number_inputs, number_hidden, number_outputs, beta):
super().__init__()
self.number_inputs = number_inputs
self.number_hidden = number_hidden
self.number_outputs = number_outputs
self.beta = beta
# Initialize layers
self.fc1 = nn.Linear(self.number_inputs, self.number_hidden) # Applies linear transformation to all input points
self.lif1 = snn.Leaky(beta = self.beta) # Integrates weighted input over time, emitting a spike if threshold condition is met
self.fc2 = nn.Linear(self.number_hidden, self.number_outputs) # Applies linear transformation to output spikes of lif1
self.lif2 = snn.Leaky(beta = self.beta) # Another spiking neuron, integrating the weighted spikes over time
"""
Forward propagation of SNN. The code below function will only be called once the input argument x
is explicitly passed into net.
#param x: input passed into the network
#return layer of output after applying final spiking neuron
"""
def forward(self, x):
num_steps = 25
# Initialize hidden states at t = 0
mem1 = self.lif1.init_leaky()
mem2 = self.lif2.init_leaky()
# Record the final layer
spk2_rec = []
mem2_rec = []
for step in range(num_steps):
cur1 = self.fc1(x)
spk1, mem1 = self.lif1(cur1, mem1)
cur2 = self.fc2(spk1)
spk2, mem2 = self.lif2(cur2, mem2)
spk2_rec.append(spk2)
mem2_rec.append(mem2)
return torch.stack(spk2_rec, dim = 0), torch.stack(mem2_rec, dim = 0)
This is my training loop:
def training_loop(net, train_loader, test_loader, dtype, device, optimizer):
num_epochs = 1
loss_history = []
test_loss_history = []
counter = 0
# Temporal dynamics
num_steps = 25
# Outer training loop
for epoch in range(num_epochs):
iter_counter = 0
train_batch = iter(train_loader)
# Minibatch training loop
for data, targets in train_batch:
data = data.to(device)
targets = targets.to(device)
# Forward pass
net.train()
print("Train")
print(data.size())
spk_rec, mem_rec = net(data.view(batch_size, -1))
# Initialize the loss and sum over time
loss_val = torch.zeros((1), dtype = dtype, device = device)
for step in range(num_steps):
loss_val += loss_function(mem_rec[step], targets.long().flatten().to(device))
# Gradient calculation and weight update
optimizer.zero_grad()
loss_val.backward()
optimizer.step()
# Store loss history for future plotting
loss_history.append(loss_val.item())
# Test set
with torch.no_grad():
net.eval()
test_data, test_targets = next(iter(test_loader))
test_data = test_data.to(device)
test_targets = test_targets.to(device)
# Test set forward pass
print("Test")
print(test_data.size())
test_spk, test_mem = net(test_data.view(batch_size, -1))
# Test set loss
test_loss = torch.zeros((1), dtype = dtype, device = device)
for step in range(num_steps):
test_loss += loss_function(test_mem[step], test_targets.long().flatten().to(device))
test_loss_history.append(test_loss.item())
# Print train/test loss and accuracy
if counter % 50 == 0:
train_printer(epoch, iter_counter, counter, loss_history, data, targets, test_data, test_targets)
counter = counter + 1
iter_counter = iter_counter + 1
return loss_history, test_loss_history
The error occurs on spk_rec, mem_rec = net(data.view(batch_size, -1)).
The code was adopted from https://snntorch.readthedocs.io/en/latest/tutorials/tutorial_5.html, where it was originally used for the MNIST dataset. However, I am not working with an image dataset. I am working with a dataset that has 21 features and predicts just one target (with 100 classes). I tried to change data.view(batch_size, -1) and test_data.view(batch_size, -1) to data.view(batch_size, 21) and test_data.view(batch_size, 21) based on some other forum answers that I saw, and my program is running for now through the training loop. Does anyone have any suggestions for how I can run through the training with no errors?
EDIT: I now get the error RuntimeError: shape '[128, 21]' is invalid for input of size 378 from spk_rec, mem_rec = net(data.view(batch_size, -1)).
Here are my DataLoaders:
train_loader = DataLoader(dataset = train, batch_size = batch_size, shuffle = True)
test_loader = DataLoader(dataset = test, batch_size = batch_size, shuffle = True)
My batch size is 128.

Tryng to run it by myself to try to solve your problem I luck also: net params and snn.snn.Leaky
import torch
from torch import nn
from torch.utils.data import DataLoader
class SpikingNeuralNetwork(nn.Module):
"""
Parameters in SpikingNeuralNetwork class:
1. number_inputs: Number of inputs to the SNN.
2. number_hidden: Number of hidden layers.
3. number_outputs: Number of output classes.
4. beta: Decay rate.
"""
def __init__(self, number_inputs, number_hidden, number_outputs, beta):
super().__init__()
self.number_inputs = number_inputs
self.number_hidden = number_hidden
self.number_outputs = number_outputs
self.beta = beta
# Initialize layers
self.fc1 = nn.Linear(self.number_inputs,
self.number_hidden) # Applies linear transformation to all input points
self.lif1 = snn.Leaky(
beta=self.beta) # Integrates weighted input over time, emitting a spike if threshold condition is met
self.fc2 = nn.Linear(self.number_hidden,
self.number_outputs) # Applies linear transformation to output spikes of lif1
self.lif2 = snn.Leaky(beta=self.beta) # Another spiking neuron, integrating the weighted spikes over time
"""
Forward propagation of SNN. The code below function will only be called once the input argument x
is explicitly passed into net.
#param x: input passed into the network
#return layer of output after applying final spiking neuron
"""
def forward(self, x):
num_steps = 25
# Initialize hidden states at t = 0
mem1 = self.lif1.init_leaky()
mem2 = self.lif2.init_leaky()
# Record the final layer
spk2_rec = []
mem2_rec = []
for step in range(num_steps):
cur1 = self.fc1(x)
spk1, mem1 = self.lif1(cur1, mem1)
cur2 = self.fc2(spk1)
spk2, mem2 = self.lif2(cur2, mem2)
spk2_rec.append(spk2)
mem2_rec.append(mem2)
return torch.stack(spk2_rec, dim=0), torch.stack(mem2_rec, dim=0)
batch_size = 2
train = torch.rand(128, 21)
test = torch.rand(128, 21)
train_loader = DataLoader(dataset=train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test, batch_size=batch_size, shuffle=True)
net = SpikingNeuralNetwork(number_inputs=1)
loss_function = nn.CrossEntropyLoss()
optimizer = nn.optim.Adam(net.parameters(), lr=0.1)
def training_loop(net, train_loader, test_loader, dtype, device, optimizer):
num_epochs = 1
loss_history = []
test_loss_history = []
counter = 0
# Temporal dynamics
num_steps = 25
# Outer training loop
for epoch in range(num_epochs):
iter_counter = 0
train_batch = iter(train_loader)
# Minibatch training loop
for data, targets in train_batch:
data = data.to(device)
targets = targets.to(device)
# Forward pass
net.train()
print("Train")
print(data.size())
spk_rec, mem_rec = net(data.view(batch_size, -1))
# Initialize the loss and sum over time
loss_val = torch.zeros((1), dtype=dtype, device=device)
for step in range(num_steps):
loss_val += loss_function(mem_rec[step], targets.long().flatten().to(device))
# Gradient calculation and weight update
optimizer.zero_grad()
loss_val.backward()
optimizer.step()
# Store loss history for future plotting
loss_history.append(loss_val.item())
# Test set
with torch.no_grad():
net.eval()
test_data, test_targets = next(iter(test_loader))
test_data = test_data.to(device)
test_targets = test_targets.to(device)
# Test set forward pass
print("Test")
print(test_data.size())
test_spk, test_mem = net(test_data.view(batch_size, -1))
# Test set loss
test_loss = torch.zeros((1), dtype=dtype, device=device)
for step in range(num_steps):
test_loss += loss_function(test_mem[step], test_targets.long().flatten().to(device))
test_loss_history.append(test_loss.item())
# Print train/test loss and accuracy
if counter % 50 == 0:
train_printer(epoch, iter_counter, counter, loss_history, data, targets, test_data, test_targets)
counter = counter + 1
iter_counter = iter_counter + 1
return loss_history, test_loss_history

Your code works just fine on the MNIST dataset, so I think it might be a problem with how the DataLoader is being called. My guess is that the total dataset is not evenly divisible by your batch_size. If this is true, then you have two options:
Instead of spk_rec, mem_rec = net(data.view(batch_size, -1)), try spk_rec, mem_rec = net(data.flatten(1)) which preserves the first dimension of your data.
Alternatively, you may need to set drop_last=True in the DataLoader functions.

Finding the number of of nodes and gpus of DistributedDataParallel

I would like to know what number should I select for nodes and gpus.
I use Tesla V100-SXM2 (8 boards).
I tried:
nodes = 1, gpus=1 (only the first gpu works)
nodes=1, gpus =8 (It took very long time and cannot execute)
Did I got wrong parameter for the nodes and gpus? or Is my code wrong ? I would appreciate if you could help me out. The code below is simplified sample code of DPP.
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
parser.add_argument('-g', '--gpus', default=1, type=int,
help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int,
help='ranking within the nodes')
parser.add_argument('--epochs', default=200, type=int, metavar='N',
help='number of total epochs to run')
args = parser.parse_args()
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = 'host1'
os.environ['MASTER_PORT'] = '7777'
mp.spawn(train, nprocs=args.gpus, args=(args,))
def train(gpu, args):
rank = args.nr * args.gpus + gpu
dist.init_process_group(
backend='nccl',
init_method='env://',
world_size=args.world_size,
rank=rank
)
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 100
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Wrapper around our model to handle parallel training
model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
# Data loading code
train_dataset = get_datasets()
# Sampler that takes care of the distribution of the batches such that
# the data is not repeated in the iteration and sampled accordingly
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=args.world_size,
rank=rank
)
# We pass in the train_sampler which can be used by the DataLoader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=0,
pin_memory=True,
sampler=train_sampler)
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(
epoch + 1,
args.epochs,
i + 1,
total_step,
loss.item())
)
if gpu == 0:
print("Training complete)

issue with arcface ( 0 accuracy)

Hello guys I've joined a university-level image recognition competition.
In the test, they will give two images (people face) and my model need to detect pair of the image is the same person or not
My model is resnet18 with IR block and SE block. and it will use Arcface loss.
I can use only the MS1M dataset with a total of 86876 classes
The problem is that loss is getting better, but accuracy is 0 and not changing.
Here's part of code I'm working on.
Train
def train_model(model, net, criterion, optimizer, scheduler, num_epochs=25):
since = time.time()
best_model_wts = copy.deepcopy(model.state_dict())
best_acc = 0.0
for epoch in range(num_epochs):
print('Epoch {}/{}'.format(epoch, num_epochs - 1))
print('-' * 10)
for phase in ['train']:
if phase == 'train':
model.train() # Set model to training mode
running_loss = 0.0
running_corrects = 0
# Iterate over data.
for inputs, labels in notebook.tqdm(dataloader):
inputs = inputs.to(device)
labels = labels.to(device).long()
# zero the parameter gradients
optimizer.zero_grad()
# forward
# track history if only in train
with torch.set_grad_enabled(phase == 'train'):
features = model(inputs)
outputs = net(features, labels)
_, preds = torch.max(outputs, 1)
loss = criterion(outputs, labels)
# backward + optimize only if in training phase
if phase == 'train':
loss.backward()
optimizer.step()
# statistics
running_loss += loss.item() * inputs.size(0)
running_corrects += torch.sum(preds == labels.data)
if phase == 'train':
scheduler.step()
epoch_loss = running_loss / len(dataloader)
epoch_acc = running_corrects.double() / len(dataloader)
print('{} Loss: {:.4f} Acc: {:.4f}'.format(
phase, epoch_loss, epoch_acc))
# deep copy the model
if phase == 'train' and epoch_acc > best_acc:
best_acc = epoch_acc
best_model_wts = copy.deepcopy(model.state_dict())
torch.save({'epoch': epoch,
'mode_state_dict': model.state_dict(),
'fc_state_dict': net.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler': scheduler.state_dict(), # HERE IS THE CHANGE
}, f'/content/drive/MyDrive/inha_data/training_saver/training_stat{epoch}.pth')
print(f'finished {epoch} and saved model_save_{epoch}.pt')
print()
time_elapsed = time.time() - since
print('Training complete in {:.0f}m {:.0f}s'.format(
time_elapsed // 60, time_elapsed % 60))
print('Best train Acc: {:4f}'.format(best_acc))
# load best model weights
model.load_state_dict(best_model_wts)
torch.save(model.state_dict(), 'model_save.pt')
return model
Parameters
train_dataset = MS1MDataset('train')
dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=128, shuffle=True,num_workers=4)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # 디바이스 설정
num_classes = 86876
# normal classifier
# net = nn.Sequential(nn.Linear(512, num_classes))
# Feature extractor backbone, input is 112x112 image output is 512 feature vector
model_ft = resnet18(True)
#set metric
metric_fc = metrics.ArcMarginProduct(512, num_classes, s = 30.0, m = 0.50, easy_margin = False)
metric_fc.to(device)
# net = net.to(device)
model_ft = model_ft.to(device)
criterion = nn.CrossEntropyLoss()
# Observe that all parameters are being optimized
optimizer_ft = torch.optim.Adam([{'params': model_ft.parameters()}, {'params': metric_fc.parameters()}],
lr=0.1)
# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=4, gamma=0.1)
Arcface
from __future__ import print_function
from __future__ import division
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import Parameter
import math
class ArcMarginProduct(nn.Module):
r"""Implement of large margin arc distance: :
Args:
in_features: size of each input sample
out_features: size of each output sample
s: norm of input feature
m: margin
cos(theta + m)
"""
def __init__(self, in_features, out_features, s=30.0, m=0.50, easy_margin=False):
super(ArcMarginProduct, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.s = s
self.m = m
self.weight = Parameter(torch.FloatTensor(out_features, in_features))
nn.init.xavier_uniform_(self.weight)
self.easy_margin = easy_margin
self.cos_m = math.cos(m)
self.sin_m = math.sin(m)
self.th = math.cos(math.pi - m)
self.mm = math.sin(math.pi - m) * m
def forward(self, input, label):
# --------------------------- cos(theta) & phi(theta) ---------------------------
cosine = F.linear(F.normalize(input), F.normalize(self.weight))
sine = torch.sqrt((1.0 - torch.pow(cosine, 2)).clamp(0, 1))
phi = cosine * self.cos_m - sine * self.sin_m
if self.easy_margin:
phi = torch.where(cosine > 0, phi, cosine)
else:
phi = torch.where(cosine > self.th, phi, cosine - self.mm)
# --------------------------- convert label to one-hot ---------------------------
# one_hot = torch.zeros(cosine.size(), requires_grad=True, device='cuda')
one_hot = torch.zeros(cosine.size(), device='cuda')
one_hot.scatter_(1, label.view(-1, 1).long(), 1)
# -------------torch.where(out_i = {x_i if condition_i else y_i) -------------
output = (one_hot * phi) + ((1.0 - one_hot) * cosine) # you can use torch.where if your torch.__version__ is 0.4
output *= self.s
# print(output)
return output
dataset
data_transforms = {
'train': transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.125, contrast=0.125, saturation=0.125),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
}
#train_ms1_data = torchvision.datasets.ImageFolder('/content/drive/MyDrive/inha_data/train', transform = data_transforms)
class MS1MDataset(Dataset):
def __init__(self,split):
self.file_list = '/content/drive/MyDrive/inha_data/ID_List.txt'
self.images = []
self.labels = []
self.transformer = data_transforms['train']
with open(self.file_list) as f:
files = f.read().splitlines()
for i, fi in enumerate(files):
fi = fi.split()
image = "/content/" + fi[1]
label = int(fi[0])
self.images.append(image)
self.labels.append(label)
def __getitem__(self, index):
img = Image.open(self.images[index])
img = self.transformer(img)
label = self.labels[index]
return img, label
def __len__(self):
return len(self.images)

You can try to use a smaller m in ArcFace, even a minus value.

Binary DenseNet 121 Classifier only predicting positive with probability >0.5

I borrowed code from this github repo for training of a DenseNet-121 [https://github.com/gaetandi/cheXpert/blob/master/cheXpert_final.ipynb][1]
The github code is for 14 class classification on the CheXpert chest X-ray dataset. I've revised it for binary classification.
# initialize and load the model
pathModel = "/ds2/images/model_ones_2epoch_densenet.tar"#"m-epoch0-07032019-213933.pth.tar"
I initialize the 14 class model so I can use the pretrained weights:
model = DenseNet121(nnClassCount).cuda()
model = torch.nn.DataParallel(model).cuda()
modelCheckpoint = torch.load(pathModel)
model.load_state_dict(modelCheckpoint['state_dict'])
And then convert to binary classification:
nnClassCount = 1
model.module.densenet121.classifier = nn.Sequential(
nn.Linear(1024, nnClassCount),
nn.Sigmoid()
).cuda()
model = torch.nn.DataParallel(model).cuda()
And then train via:
batch, losst, losse = CheXpertTrainer.train(model, dataLoaderTrain, dataLoaderVal, nnClassCount, 100, timestampLaunch, checkpoint = None, weight_path = weight_path)
My training data is laid out in a 2 column csv with column headers ('Path' and 'Class-Positive'), with path locations in the first column and 0 or 1 in the second column. I used oversampling when compiling the training list so paths in the csv are roughly a 50/50 split between 0's and 1's...shuffled.
I use livelossplot to monitor training/validation loss and accuracy. My loss plots look as expected but accuracy plots are flatlined around 0.5 (which makes sense given the 50/50 data if the net is saying its 100% positive or negative). I'm assuming I'm doing something wrong in how I'm doing predictions, but maybe something in the training is incorrect.
For predictions and probabilities I'm running:
varOutput = model(varInput)
_, preds = torch.max(varOutput, 1)
print('varshape: ',varOutput.shape)
probs = torch.sigmoid(varOutput)
*My issue: preds are all coming out as 0 and probs all above 0.5 *
Here is the initial code from github:
import os
import numpy as np
import time
import sys
import csv
import cv2
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn
import torchvision
import torchvision.transforms as transforms
import torch.optim as optim
import torch.nn.functional as tfunc
from torch.utils.data import Dataset
from torch.utils.data.dataset import random_split
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import ReduceLROnPlateau
from PIL import Image
import torch.nn.functional as func
from sklearn.metrics.ranking import roc_auc_score
import sklearn.metrics as metrics
import random
use_gpu = torch.cuda.is_available()
# Paths to the files with training, and validation sets.
# Each file contains pairs (path to image, output vector)
pathFileTrain = '../CheXpert-v1.0-small/train.csv'
pathFileValid = '../CheXpert-v1.0-small/valid.csv'
# Neural network parameters:
nnIsTrained = False #pre-trained using ImageNet
nnClassCount = 14 #dimension of the output
# Training settings: batch size, maximum number of epochs
trBatchSize = 64
trMaxEpoch = 3
# Parameters related to image transforms: size of the down-scaled image, cropped image
imgtransResize = (320, 320)
imgtransCrop = 224
# Class names
class_names = ['No Finding', 'Enlarged Cardiomediastinum', 'Cardiomegaly', 'Lung Opacity',
'Lung Lesion', 'Edema', 'Consolidation', 'Pneumonia', 'Atelectasis', 'Pneumothorax',
'Pleural Effusion', 'Pleural Other', 'Fracture', 'Support Devices']
class CheXpertDataSet(Dataset):
def __init__(self, image_list_file, transform=None, policy="ones"):
"""
image_list_file: path to the file containing images with corresponding labels.
transform: optional transform to be applied on a sample.
Upolicy: name the policy with regard to the uncertain labels
"""
image_names = []
labels = []
with open(image_list_file, "r") as f:
csvReader = csv.reader(f)
next(csvReader, None)
k=0
for line in csvReader:
k+=1
image_name= line[0]
label = line[5:]
for i in range(14):
if label[i]:
a = float(label[i])
if a == 1:
label[i] = 1
elif a == -1:
if policy == "ones":
label[i] = 1
elif policy == "zeroes":
label[i] = 0
else:
label[i] = 0
else:
label[i] = 0
else:
label[i] = 0
image_names.append('../' + image_name)
labels.append(label)
self.image_names = image_names
self.labels = labels
self.transform = transform
def __getitem__(self, index):
"""Take the index of item and returns the image and its labels"""
image_name = self.image_names[index]
image = Image.open(image_name).convert('RGB')
label = self.labels[index]
if self.transform is not None:
image = self.transform(image)
return image, torch.FloatTensor(label)
def __len__(self):
return len(self.image_names)
#TRANSFORM DATA
normalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
transformList = []
#transformList.append(transforms.Resize(imgtransCrop))
transformList.append(transforms.RandomResizedCrop(imgtransCrop))
transformList.append(transforms.RandomHorizontalFlip())
transformList.append(transforms.ToTensor())
transformList.append(normalize)
transformSequence=transforms.Compose(transformList)
#LOAD DATASET
dataset = CheXpertDataSet(pathFileTrain ,transformSequence, policy="ones")
datasetTest, datasetTrain = random_split(dataset, [500, len(dataset) - 500])
datasetValid = CheXpertDataSet(pathFileValid, transformSequence)
#Problèmes de l'overlapping de patients et du transform identique ?
dataLoaderTrain = DataLoader(dataset=datasetTrain, batch_size=trBatchSize, shuffle=True, num_workers=24, pin_memory=True)
dataLoaderVal = DataLoader(dataset=datasetValid, batch_size=trBatchSize, shuffle=False, num_workers=24, pin_memory=True)
dataLoaderTest = DataLoader(dataset=datasetTest, num_workers=24, pin_memory=True)
class CheXpertTrainer():
def train (model, dataLoaderTrain, dataLoaderVal, nnClassCount, trMaxEpoch, launchTimestamp, checkpoint):
#SETTINGS: OPTIMIZER & SCHEDULER
optimizer = optim.Adam (model.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, weight_decay=1e-5)
#SETTINGS: LOSS
loss = torch.nn.BCELoss(size_average = True)
#LOAD CHECKPOINT
if checkpoint != None and use_gpu:
modelCheckpoint = torch.load(checkpoint)
model.load_state_dict(modelCheckpoint['state_dict'])
optimizer.load_state_dict(modelCheckpoint['optimizer'])
#TRAIN THE NETWORK
lossMIN = 100000
for epochID in range(0, trMaxEpoch):
timestampTime = time.strftime("%H%M%S")
timestampDate = time.strftime("%d%m%Y")
timestampSTART = timestampDate + '-' + timestampTime
batchs, losst, losse = CheXpertTrainer.epochTrain(model, dataLoaderTrain, optimizer, trMaxEpoch, nnClassCount, loss)
lossVal = CheXpertTrainer.epochVal(model, dataLoaderVal, optimizer, trMaxEpoch, nnClassCount, loss)
timestampTime = time.strftime("%H%M%S")
timestampDate = time.strftime("%d%m%Y")
timestampEND = timestampDate + '-' + timestampTime
if lossVal < lossMIN:
lossMIN = lossVal
torch.save({'epoch': epochID + 1, 'state_dict': model.state_dict(), 'best_loss': lossMIN, 'optimizer' : optimizer.state_dict()}, 'm-epoch'+str(epochID)+'-' + launchTimestamp + '.pth.tar')
print ('Epoch [' + str(epochID + 1) + '] [save] [' + timestampEND + '] loss= ' + str(lossVal))
else:
print ('Epoch [' + str(epochID + 1) + '] [----] [' + timestampEND + '] loss= ' + str(lossVal))
return batchs, losst, losse
#--------------------------------------------------------------------------------
def epochTrain(model, dataLoader, optimizer, epochMax, classCount, loss):
batch = []
losstrain = []
losseval = []
model.train()
for batchID, (varInput, target) in enumerate(dataLoaderTrain):
varTarget = target.cuda(non_blocking = True)
#varTarget = target.cuda()
varOutput = model(varInput)
lossvalue = loss(varOutput, varTarget)
optimizer.zero_grad()
lossvalue.backward()
optimizer.step()
l = lossvalue.item()
losstrain.append(l)
if batchID%35==0:
print(batchID//35, "% batches computed")
#Fill three arrays to see the evolution of the loss
batch.append(batchID)
le = CheXpertTrainer.epochVal(model, dataLoaderVal, optimizer, trMaxEpoch, nnClassCount, loss).item()
losseval.append(le)
print(batchID)
print(l)
print(le)
return batch, losstrain, losseval
#--------------------------------------------------------------------------------
def epochVal(model, dataLoader, optimizer, epochMax, classCount, loss):
model.eval()
lossVal = 0
lossValNorm = 0
with torch.no_grad():
for i, (varInput, target) in enumerate(dataLoaderVal):
target = target.cuda(non_blocking = True)
varOutput = model(varInput)
losstensor = loss(varOutput, target)
lossVal += losstensor
lossValNorm += 1
outLoss = lossVal / lossValNorm
return outLoss
#--------------------------------------------------------------------------------
#---- Computes area under ROC curve
#---- dataGT - ground truth data
#---- dataPRED - predicted data
#---- classCount - number of classes
def computeAUROC (dataGT, dataPRED, classCount):
outAUROC = []
datanpGT = dataGT.cpu().numpy()
datanpPRED = dataPRED.cpu().numpy()
for i in range(classCount):
try:
outAUROC.append(roc_auc_score(datanpGT[:, i], datanpPRED[:, i]))
except ValueError:
pass
return outAUROC
#--------------------------------------------------------------------------------
def test(model, dataLoaderTest, nnClassCount, checkpoint, class_names):
cudnn.benchmark = True
if checkpoint != None and use_gpu:
modelCheckpoint = torch.load(checkpoint)
model.load_state_dict(modelCheckpoint['state_dict'])
if use_gpu:
outGT = torch.FloatTensor().cuda()
outPRED = torch.FloatTensor().cuda()
else:
outGT = torch.FloatTensor()
outPRED = torch.FloatTensor()
model.eval()
with torch.no_grad():
for i, (input, target) in enumerate(dataLoaderTest):
target = target.cuda()
outGT = torch.cat((outGT, target), 0).cuda()
bs, c, h, w = input.size()
varInput = input.view(-1, c, h, w)
out = model(varInput)
outPRED = torch.cat((outPRED, out), 0)
aurocIndividual = CheXpertTrainer.computeAUROC(outGT, outPRED, nnClassCount)
aurocMean = np.array(aurocIndividual).mean()
print ('AUROC mean ', aurocMean)
for i in range (0, len(aurocIndividual)):
print (class_names[i], ' ', aurocIndividual[i])
return outGT, outPRED
class DenseNet121(nn.Module):
"""Model modified.
The architecture of our model is the same as standard DenseNet121
except the classifier layer which has an additional sigmoid function.
"""
def __init__(self, out_size):
super(DenseNet121, self).__init__()
self.densenet121 = torchvision.models.densenet121(pretrained=True)
num_ftrs = self.densenet121.classifier.in_features
self.densenet121.classifier = nn.Sequential(
nn.Linear(num_ftrs, out_size),
nn.Sigmoid()
)
def forward(self, x):
x = self.densenet121(x)
return x
# initialize and load the model
model = DenseNet121(nnClassCount).cuda()
model = torch.nn.DataParallel(model).cuda()
timestampTime = time.strftime("%H%M%S")
timestampDate = time.strftime("%d%m%Y")
timestampLaunch = timestampDate + '-' + timestampTime
batch, losst, losse = CheXpertTrainer.train(model, dataLoaderTrain, dataLoaderVal, nnClassCount, trMaxEpoch, timestampLaunch, checkpoint = None)
print("Model trained")

It looks like you have adapted the training correctly for the binary classification, but the prediction wasn't, as you are still trying it as if it were a multi-class prediction.
The output of your model (varOutput) has the size (batch_size, 1), since there is only one class. The maximum across that dimension will always be 0, since that is the only class available, there is no separate class for 1.
This single class represents both cases (0 and 1), so you can consider it is a the probability of it being positive (1). To get the distinct value of either 0 or 1, you simply use a threshold of 0.5, so everything below that receives the class 0 and above that 1. This can be easily done with torch.round.
But you also have another problem, you're applying the sigmoid function twice in a row, once in the classifier nn.Sigmoid() and then afterwards again torch.sigmoid(varOutput). That is problematic, because sigmoid(0) = 0.5, hence all your probabilities are over 0.5.
The output of your model are already the probabilities, the only thing left is to round them:
probs = model(varInput)
# The .squeeze(1) is to get rid of the singular class dimension
preds = torch.round(probs).squeeze(1)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Why am I facing OOM in dali/pytorch inference pipeline? - deep-learning

Related

My neural network does not appear to learn DQN

RuntimeError: shape '[128, -1]' is invalid for input of size 378 pytorch

Finding the number of of nodes and gpus of DistributedDataParallel

issue with arcface ( 0 accuracy)

Binary DenseNet 121 Classifier only predicting positive with probability >0.5

Categories

Resources