How can I extract label from results in YOLO v5? - deep-learning

Is there any way to extract the detected label like person or cat, dog or others that is printing by the results.print() function? I want these detected labels to be saved in an array and use it later. I am using YOLOv5 model here.
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
# Make detections
results = model(frame)
results.print()
# Showing the box and prediction
cv2.imshow('YOLO', np.squeeze(results.render()))
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
The printed output of the results.print() was like this -
image 1/1: 480x640 1 person
Speed: 7.0ms pre-process, 80.6ms inference, 3.5ms NMS per image at shape (1, 3, 480, 640)
From this output, I wanna extract the person label and store it in an array.

This might not be the optimal solution, but here's an approach that I used for a personal project:
lst = []
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
# Make detections
results = model(frame)
cv2.imshow('YOLO', np.squeeze(results.render()))
df = results.pandas().xyxy[0]
for i in df['name']: # name->labels
lst.append(i)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
I have used results.pandas().xyxy[0] function to get the results as a data frame and then appended the labels to a list.

Assuming you use YoloV5 with pytorch, please see this link. It detailes how to interpret the results as json objects and also explains the structure.

Related

Training LSTM but output converges to a single value after a few inputs

Background
I'm learning about LSTMs and figured I'd try training a very basic LSTM model (big believer of learning through doing).
To start with something basic, I tried implement a LSTM that would sum up the last 10 inputs it has seen. I generated a dataset consisting of 1000 random numbers between 0 and 1, with 1000 labels representing the sum of the previous 10 numbers (label[i] = data[i-9:i+1].sum()), and tried to train the LSTM to recognize this pattern.
I know that simpler models can solve this (ie. linear regression), but I believe that LSTMs should also be able to solve this fairly basic problem.
My initial implementation does seem to work, but when I try to improve the implementation, I start getting constant output values after a few timestamps, looks like it's approximately the average of the training labels.
I'd appreciate any insights as to why the second and third iterations don't work, especially the third iteration, since it looks like it's the same implementation as what I read in "Deep Learning for Coders with fastai & PyTorch" book.
What I've tried so far
I've done 3 iterations so far:
Initially, I generated all sub-sequences of length 10 from the input data along with the corresponding label ([(data[i-9:i+1], label[i]) for i in range(9, len(data))] and fed this into the LSTM
This iteration worked very well, if I feed a sequence of 10 inputs I get an output from the LSTM that is very close to the sum. However, it is kinda cheating in that I'm basically telling the LSTM that the sequerce is length 10. I believe that LSTM should be able to infer the sequence length, so I tried to remove that bit of information.
In my second iteration, I feed the entire sequence into the LSTM at once: ([data[i] for i in range(len(data)), [label[i] for i in range(len(data))]). Basically, a single input with a sequence length of 1000, with a single output of 1000 labels.
However, after training, while running on validation data of 100, all except the first few labels are always a constant number, approximately the average of the training labels.
In my last iteration, I tried feeding inputs one at a time to the LSTM (1000 inputs with a sequence length of 1), with manually storing the hidden and cell states and passing it into the next run with the next input. This produces similar results to #2.
Network setup
For all runs, I used a single layer LSTM with 25 hidden since it's a fairly simple problem. I did try adding layers with dropout or increasing hidden size but didn't help.
Code samples
First iteration:
class LSTMModel(Module):
def __init__(self, seq_len, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self,x):
x, _ = self.lstm1(x)
# [:,-1,:] to grab the last output for sequence length of 10
x = self.fc(x[:,-1,:])
return x[:,0]
Second iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False)
self.fc = nn.Linear(hidden_size, 1)
self.input_size = input_size
def forward(self,x):
x,h = self.lstm1(x)
x = self.fc(x)
return x
Final iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
self.h = [torch.zeros(layers, 1, hidden_size).cuda() for _ in range(2)]
def forward(self,x):
x,h = self.lstm1(x, self.h)
self.h = [h_.detach() for h_ in h]
x = self.fc(x)
return x
def reset(self):
for h in self.h: h.zero_()

When I use image encoder of CLIP model and text encode of Roberta model, Topk of text to image increased to 15 and then immediately decreased to 0.1

I am doing the task of pedestrian with text retrieval
I use the image encoder of the clip model to encode the image, and then use the Robert encoding language to finally calculate the cos similarity of two 768 dimensional vectors
And I only use the image encoder of CLIP with the projector removed as image encoder.
I replaced the text encoder and image encoder on the original frame, and the result of the original frame is reasonable.
Topk of text to image in new encoder increased to 15 and then immediately decreased to 0.1 in 9 epoch, while in the original frame, the topk can reach 58.
class RobertaTextEncode(nn. Module):
def_init_(self, args):
super(RobertaTextEncode, self)._init
self. out_channels = 768
self. args = args
self. in_palnes = 768
self. tokenizer = RobertaTokenizerFast. from_pretrained(' roberta-base')
self. text_encode = RobertaModel. from_pretrained(' roberta-base')
def forward(self, captions):
caption = [ caption.text for caption in captions]
device = torch. device("cuda:0"if torch. cuda. is_available() else "cpu")
tokenized = self. tokenizer. batch_encode_plus(caption, truncation=' longest_first', padding=' max_length', max_length=self. args. MODEL. text_length, add_special_tokens=True, return_tensors=' pt'). to(device)
encode_text = self. text_encode(** tokenized)
text_feature = encode_text. pooler_output # [b, 768]
return text_feature
def load_pretrain_model(model_path):
from . clip import clip
url = clip._MODELS[ model_path]
model_path = clip._download(url)
try:
model = torch. jit. load(model_path, map_location="cpu"). eval()
state_dict=None
except
RuntimeError: state_dict = torch. load(model_path, map_location="cpu")
h_resolution = int((224-32)//32+1)
w_resolution = int((224-32)//32+1)
model = clip. build_model(state_dict or model. state_dictC), h_resolution, w_resolution, 32)
return model
class clipImageEncode(nn.Module):
def __init__(self, cfg):
clip_model = load_pretrain_model('ViT-B/32)
clip_model.to('cuda')
self.image_encode = clip.model.encode_image
def forward(self, x):
visual_feat = self.image_encode(x)
return visual_feat
I want to know why. I would appreciate it if you could provide suggestions.

How can I concatenate the 4 corners of the image quickly when loading image in deep learning?

What is the most effective way to concatenate 4 corner, shown in this photo ?
(conducting in getitem())
left_img = Image.open('image.jpg')
...
output = right_img
This is how I would do it.
Firstly I would convert the image to a Tensor Image temporarily
from torchvision import transforms
tensor_image = transforms.ToTensor()(image)
Now assuming you have a 3 channel image (although similiar principles apply to any matrices of any number of channels including 1 channel gray scale images).
You can find the Red channel with tensor_image[0] the Green channel with tensor_image[1] and the the Blue channel with tensor_image[2]
You can make a for loop iterating through each channel like
for i in tensor_image.size(0):
curr_channel = tensor_image[i]
Now inside that for loop with each channel you can extract the
First corner pixel with float(curr_channel[0][0])
Last top corner pixel with float(curr_channel[0][-1])
Bottom first pixel with float(curr_channel[-1][0])
Bottom and last pixel with float(curr_channel[-1][-1])
Make sure to convert all the pixel values to float or double values before this next appending step
Now you have four values that correspond to the corner pixels of each channel
Then you can make a list called new_image = []
You can then append the above mentioned pixel values using
new_image.append([[curr_channel[0][0], curr_channel[0][-1]], [curr_channel[-1][0], curr_channel[-1][-1]]])
Now after iterating through every channel you should have a big list that contains three (or tensor_image.size(0)) number of lists of lists.
Next step is to convert this list of lists of lists to a torch.tensor by running
new_image = torch.tensor(new_image)
To make sure everything is right new_image.size() should return torch.Size([3, 2, 2])
If that is the case you now have your wanted image but it is tensor format.
The way to convert it back to PIL is to run
final_pil_image = transforms.ToPILImage()(new_image)
If everything went good, you should have a pil image that fulfills your task. The only code it uses is clever indexing and one for loop.
There is a possibility however if you look more than I can, then you can avoid using a for loop and perform operations on all the channels without the loop.
Sarthak Jain
I don't know how quick this is but here:
import numpy as np
img = np.array(Image.open('image.jpg'))
w, h = img.shape[0], image.shape[1]
# the window size:
r = 4
upper_left = img[:r, :r]
lower_left = img[h-r:, :r]
upper_right = img[:r, w-r:]
lower_right = img[h-r:, w-r:]
upper_half = np.concatenate((upper_left, upper_right), axis=1)
lower_half = np.concatenate((lower_left, lower_right), axis=1)
img = np.concatenate((upper_half, lower_half))
or short:
upper_half = np.concatenate((img[:r, :r], img[:r, w-r:]), axis=1)
lower_half = np.concatenate((img[h-r:, :r], img[h-r:, w-r:]), axis=1)
img = np.concatenate((upper_half, lower_half))

Training loss is Nan using image segmentation in TPU using TFrecords

I am a beginner trying to work with TPUs using Tensorflow in Kaggle Kernels. I previously trained an Unet model using a dataset in GPU, and now I am trying to implement that in TPU. I made a tfrecord out of the dataset images and mask, and the TFrecord returns image and mask. When I try to train in TPU, the loss is always Nan, even though the metrics accuracy is normal. Since this is the same model and loss I used in GPU, I am guessing the problem is in tfrecord or loading dataset.
The code for loading data is given below :
def decode_image(image_data):
image = tf.image.decode_jpeg(image_data, channels=3)
image = tf.cast(image, tf.float32) / (255.0) # convert image to floats in [0, 1] range
image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
return image
def decode_image_mask(image_data):
image = tf.image.decode_jpeg(image_data, channels=3)
image = tf.cast(image, tf.float64) / (255.0) # convert image to floats in [0, 1] range
image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
image=tf.image.rgb_to_grayscale(image)
image=tf.math.round(image)
return image
def read_tfrecord(example):
TFREC_FORMAT = {
"image": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
"mask": tf.io.FixedLenFeature([], tf.string), # shape [] means single element
}
example = tf.io.parse_single_example(example, TFREC_FORMAT)
image = decode_image(example['image'])
mask=decode_image_mask(example['mask'])
return image, mask
def load_dataset(filenames, ordered=False):
# Read from TFRecords. For optimal performance, reading from multiple files at once and
# disregarding data order. Order does not matter since we will be shuffling the data anyway.
ignore_order = tf.data.Options()
if not ordered:
ignore_order.experimental_deterministic = False # disable order, increase speed
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) # automatically interleaves reads from multiple files
dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
dataset = dataset.map(read_tfrecord, num_parallel_calls=AUTO)
return dataset
def get_training_dataset():
dataset = load_dataset(TRAINING_FILENAMES)
dataset = dataset.repeat() # the training dataset must repeat for several epochs
dataset = dataset.shuffle(2048)
dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)
dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
return dataset
def get_validation_dataset(ordered=False):
dataset = load_dataset(VALIDATION_FILENAMES, ordered=ordered)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.cache()
dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
return dataset
def count_data_items(filenames):
# the number of data items is written in the name of the .tfrec files, i.e. flowers00-230.tfrec = 230 data items
n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
return np.sum(n)
So, what am I doing wrong?
Turns out the problem was that I was unbatching the data and batching it to 20 to properly view the image and masks in matplotlib, and this was screwing up how data was being sent to the model, hence the Nan loss. Making another copy of the dataset and using that to view image, while sending the original to train solved this problem.

Tensor shape mismatch error in PyTorch on MNIST dataset, but no error on synthetic data

I am trying to implement a Deep Learning paper (https://github.com/kiankd/corel2019) and having a weird error when supplying real data (MNIST) to it, but no error when using the same synthetic data as the authors used.
The error happens in this function:
def get_armask(shape, labels, device=None):
mask = torch.zeros(shape).to(device)
arr = torch.arange(0, shape[0]).long().to(device)
mask[arr, labels] = -1.
return mask
More specifically this line:
mask[arr, labels] = -1.
The error is:
RuntimeError: The shape of the mask [500] at index 0 does not match the shape of the indexed tensor [500, 10] at index 1
The weird thing is, that if I use the synthetic data, there is no error and it works perfectly. If I print out the shapes, I get the following (both with synthetic data and with MNIST):
mask torch.Size([500, 10])
arr torch.Size([500])
labels torch.Size([500])
The code used to generate the synthetic data is the following:
X_data = (torch.rand(N_samples, D_input) * 10.).to(device)
labels = torch.LongTensor([i % N_classes for i in range(N_samples)]).to(device)
While the code to load MNIST is this:
train_images = mnist.train_images()
X_data_all = train_images.reshape((train_images.shape[0], train_images.shape[1] * train_images.shape[2]))
X_data = torch.tensor(X_data_all[:500,:]).to(device)
X_data = X_data.type(torch.FloatTensor)
labels = torch.tensor(mnist.train_labels()[:500]).to(device)
get_armask is used the following way:
def forward(self, predictions, labels):
mask = get_armask(predictions.shape, labels, device=self.device)
# make the attractor and repulsor, mask them!
attraction_tensor = mask * predictions
repulsion_tensor = (mask + 1) * predictions
# now, apply the special cosine-COREL rules, taking the argmax and squaring the repulsion
repulsion_tensor, _ = repulsion_tensor.max(dim=1)
repulsion_tensor = repulsion_tensor ** 2
return arloss(attraction_tensor, repulsion_tensor, self.lam)
The actual error seems to be different from what is in the error message, but I have no idea where to look. I tried a few things, like changing the learning rate, normalizing the MNIST data to be more or less in the same range as the test data but nothing seems to work.
Any suggestions? Thanks a lot in advance!
After exchanging some emails with the author of the paper we figured out what is the problem. The labels were type of Byte instead of Long, that caused the error. The error message is very misleading, the actual problem has nothing to do with the sizes...