When I use image encoder of CLIP model and text encode of Roberta model, Topk of text to image increased to 15 and then immediately decreased to 0.1 - deep-learning

I am doing the task of pedestrian with text retrieval
I use the image encoder of the clip model to encode the image, and then use the Robert encoding language to finally calculate the cos similarity of two 768 dimensional vectors
And I only use the image encoder of CLIP with the projector removed as image encoder.
I replaced the text encoder and image encoder on the original frame, and the result of the original frame is reasonable.
Topk of text to image in new encoder increased to 15 and then immediately decreased to 0.1 in 9 epoch, while in the original frame, the topk can reach 58.
class RobertaTextEncode(nn. Module):
def_init_(self, args):
super(RobertaTextEncode, self)._init
self. out_channels = 768
self. args = args
self. in_palnes = 768
self. tokenizer = RobertaTokenizerFast. from_pretrained(' roberta-base')
self. text_encode = RobertaModel. from_pretrained(' roberta-base')
def forward(self, captions):
caption = [ caption.text for caption in captions]
device = torch. device("cuda:0"if torch. cuda. is_available() else "cpu")
tokenized = self. tokenizer. batch_encode_plus(caption, truncation=' longest_first', padding=' max_length', max_length=self. args. MODEL. text_length, add_special_tokens=True, return_tensors=' pt'). to(device)
encode_text = self. text_encode(** tokenized)
text_feature = encode_text. pooler_output # [b, 768]
return text_feature
def load_pretrain_model(model_path):
from . clip import clip
url = clip._MODELS[ model_path]
model_path = clip._download(url)
try:
model = torch. jit. load(model_path, map_location="cpu"). eval()
state_dict=None
except
RuntimeError: state_dict = torch. load(model_path, map_location="cpu")
h_resolution = int((224-32)//32+1)
w_resolution = int((224-32)//32+1)
model = clip. build_model(state_dict or model. state_dictC), h_resolution, w_resolution, 32)
return model
class clipImageEncode(nn.Module):
def __init__(self, cfg):
clip_model = load_pretrain_model('ViT-B/32)
clip_model.to('cuda')
self.image_encode = clip.model.encode_image
def forward(self, x):
visual_feat = self.image_encode(x)
return visual_feat
I want to know why. I would appreciate it if you could provide suggestions.

Related

How can I extract label from results in YOLO v5?

Is there any way to extract the detected label like person or cat, dog or others that is printing by the results.print() function? I want these detected labels to be saved in an array and use it later. I am using YOLOv5 model here.
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
# Make detections
results = model(frame)
results.print()
# Showing the box and prediction
cv2.imshow('YOLO', np.squeeze(results.render()))
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
The printed output of the results.print() was like this -
image 1/1: 480x640 1 person
Speed: 7.0ms pre-process, 80.6ms inference, 3.5ms NMS per image at shape (1, 3, 480, 640)
From this output, I wanna extract the person label and store it in an array.
This might not be the optimal solution, but here's an approach that I used for a personal project:
lst = []
cap = cv2.VideoCapture(0)
while cap.isOpened():
ret, frame = cap.read()
# Make detections
results = model(frame)
cv2.imshow('YOLO', np.squeeze(results.render()))
df = results.pandas().xyxy[0]
for i in df['name']: # name->labels
lst.append(i)
if cv2.waitKey(10) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
I have used results.pandas().xyxy[0] function to get the results as a data frame and then appended the labels to a list.
Assuming you use YoloV5 with pytorch, please see this link. It detailes how to interpret the results as json objects and also explains the structure.

Pytthon 3,8 Pygame 2 in W10 cant save png file [duplicate]

I am making an image cropper using pygame as interface and opencv for image processing.
I have created function like crop(), colorfilter() etc but i load image as pygame.image.load() to show it on screen but when i perform crop() it is numpy.ndarray and pygame cannot load it getting error:
argument 1 must be pygame.Surface, not numpy.ndarray
how do i solve this problem. i need to blit() the cropped image. should save image and read it then delete it after its done as i want to apply more than one filters.
The following function converts a OpenCV (cv2) image respectively a numpy.array (that's the same) to a pygame.Surface:
import numpy as np
def cv2ImageToSurface(cv2Image):
if cv2Image.dtype.name == 'uint16':
cv2Image = (cv2Image / 256).astype('uint8')
size = cv2Image.shape[1::-1]
if len(cv2Image.shape) == 2:
cv2Image = np.repeat(cv2Image.reshape(size[1], size[0], 1), 3, axis = 2)
format = 'RGB'
else:
format = 'RGBA' if cv2Image.shape[2] == 4 else 'RGB'
cv2Image[:, :, [0, 2]] = cv2Image[:, :, [2, 0]]
surface = pygame.image.frombuffer(cv2Image.flatten(), size, format)
return surface.convert_alpha() if format == 'RGBA' else surface.convert()
See How do I convert an OpenCV (cv2) image (BGR and BGRA) to a pygame.Surface object for a detailed explanation of the function.

Training loss is Nan using image segmentation in TPU using TFrecords

I am a beginner trying to work with TPUs using Tensorflow in Kaggle Kernels. I previously trained an Unet model using a dataset in GPU, and now I am trying to implement that in TPU. I made a tfrecord out of the dataset images and mask, and the TFrecord returns image and mask. When I try to train in TPU, the loss is always Nan, even though the metrics accuracy is normal. Since this is the same model and loss I used in GPU, I am guessing the problem is in tfrecord or loading dataset.
The code for loading data is given below :
def decode_image(image_data):
image = tf.image.decode_jpeg(image_data, channels=3)
image = tf.cast(image, tf.float32) / (255.0) # convert image to floats in [0, 1] range
image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
return image
def decode_image_mask(image_data):
image = tf.image.decode_jpeg(image_data, channels=3)
image = tf.cast(image, tf.float64) / (255.0) # convert image to floats in [0, 1] range
image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
image=tf.image.rgb_to_grayscale(image)
image=tf.math.round(image)
return image
def read_tfrecord(example):
TFREC_FORMAT = {
"image": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
"mask": tf.io.FixedLenFeature([], tf.string), # shape [] means single element
}
example = tf.io.parse_single_example(example, TFREC_FORMAT)
image = decode_image(example['image'])
mask=decode_image_mask(example['mask'])
return image, mask
def load_dataset(filenames, ordered=False):
# Read from TFRecords. For optimal performance, reading from multiple files at once and
# disregarding data order. Order does not matter since we will be shuffling the data anyway.
ignore_order = tf.data.Options()
if not ordered:
ignore_order.experimental_deterministic = False # disable order, increase speed
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) # automatically interleaves reads from multiple files
dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
dataset = dataset.map(read_tfrecord, num_parallel_calls=AUTO)
return dataset
def get_training_dataset():
dataset = load_dataset(TRAINING_FILENAMES)
dataset = dataset.repeat() # the training dataset must repeat for several epochs
dataset = dataset.shuffle(2048)
dataset = dataset.batch(BATCH_SIZE,drop_remainder=True)
dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
return dataset
def get_validation_dataset(ordered=False):
dataset = load_dataset(VALIDATION_FILENAMES, ordered=ordered)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.cache()
dataset = dataset.prefetch(AUTO) # prefetch next batch while training (autotune prefetch buffer size)
return dataset
def count_data_items(filenames):
# the number of data items is written in the name of the .tfrec files, i.e. flowers00-230.tfrec = 230 data items
n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
return np.sum(n)
So, what am I doing wrong?
Turns out the problem was that I was unbatching the data and batching it to 20 to properly view the image and masks in matplotlib, and this was screwing up how data was being sent to the model, hence the Nan loss. Making another copy of the dataset and using that to view image, while sending the original to train solved this problem.

Can convolution layer of encodr and decoder be different in convolutional auto-encoder

for example,we have different filter size and the number of feature map,and the number of convolutional layer are also different, the hidden units are more than input units,the specific code is as follows. I don't know if this is called convolutional auto-encoder, or it has to be decoded and encoded in the same way. I hope someone can help me answer this question. Thank you very much.
input_data = Input(shape=(1,128,3))
x = Conv2d(6,(1,1),padding='same')(input_data)
new_input_data = keras.layers.concatenate([input_data,x],axis=-1)
x = Conv2D(40,(1,6),activation='relu',padding='same')(new_input_data)
encoded = MaxPooling2D((1,2),padding='same')(x)
x = Conv2D(40,(1,6),activation='relu',padding='same')(encoded)
x = UpSampling2D((1,2))(x)
decoded = Conv2D(3,(1,6),activation='relu',padding='same')(x)
the change of channels :3-->6-->9-->40 40-->3

Image Captioning Example input size of Decoder LSTM Pytorch

I'm new to Pytorch, there is a doubt that am having in the Image Captioning example code . In DcoderRNN class the lstm is defined as ,
self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
in the forward function ,
embeddings = self.embed(captions)
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
we first embed the captions and then concat the embeddings with the context feature from the EncoderCNN, but the concat increases the size from embed size how we can forward that to the lstm? as the input size of lstm is already defined as embed_size.
Am I missing something here? Thanks in advance .
You can analyze the shape of all input and output tensors and then it will become easier for you to understand what changes you need to make.
Let's say: captions = B x S where S = sentence (caption) length.
embeddings = self.embed(captions)
Now, embeddings = B x S x E where E = embed_size.
embeddings = torch.cat((features.unsqueeze(1), embeddings), 1)
Here, embeddings = B x (S + 1) X E.
My understanding says you are doing wrong here. I guess you should concatenate features along axis=2. Because probably you want to concatenate the image features along with the word embeddings for each word in the caption. So, if you do:
embeddings = torch.cat((features.unsqueeze(1), embeddings), 2)
It results in, embeddings = B X S X (E + F) where E + F = embed_size + img_feat_size
Then you need to revise your LSTM definition as follows.
self.lstm = nn.LSTM(embed_size+img_feat_size, hidden_size, num_layers, batch_first=True)
My experience says, usually, people concatenate image features with word features and pass it to the LSTM layer.