I have followed the tutorial from this website https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=U5LhISJqWXgM.
The question is: there were images that contained 'person' and many other instances but why weren't they detected and segmented?
from detectron2.utils.visualizer import ColorMode
dataset_dicts = get_balloon_dicts("balloon/val")
for d in random.sample(dataset_dicts, 3):
im = cv2.imread(<CUSTOM_IMAGE_CONTAINING_PERSON_DETECTED_WHEN_CUSTOM_WASNT_USED>) <--changed
outputs = predictor(im)
v = Visualizer(im[:, :, ::-1],
metadata=balloon_metadata,
scale=0.8,
instance_mode=ColorMode.IMAGE_BW # remove the colors of unsegmented pixels
)
v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(v.get_image()[:, :, ::-1])
After downloading a custom photo containing people I ran inference on the image as instructured under Run a pre-trained detectron2 model using the model they used (model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")) but when I use the trained model of the balloons it does not detect the people in the image (the threshold was not the problem since I used 0.5 in both cases). Why is it so and how would I be able to make it show all the instances? Help would be greatly appreciated :D
Related
I solved the problem of recognizing handwritten numbers using the Internet. It gave correct answers and had an accuracy of ~97.5%. But I wanted to test it on my own data. In this case, she was always wrong. I first gave her a photo of the numbers from the paper (using Opencv, I scaled them, made them gray). Having received an unsatisfactory result, I began to "feed" her the numbers from Paint.) But in the end, the result remained unsatisfied.
Photo fraud:
image = cv2.imread("22.png")
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray_image = cv2.resize(gray_image, (28, 28), interpolation = cv2.INTER_AREA)
print(gray_image.shape)
cv2_imshow(gray_image)
Launch code NN:
x = np.expand_dims(gray_image, axis=0)
res = model.predict(x)
print( res )
print( np.argmax(res) )
Also, I attach the data that I gave to NN
Here Here and Here
According to NN - all this is equal to 5
I tried to train the neural network better, change the data, change the code. But it didn't affect anything.
I have a place in my code where I take a photo from the Mnist database and see what NN saw in this photo. I tried to take the same code and overlay it on my data. It didn't work.
This is the place:
n = 36
x = np.expand_dims(x_test[n], axis=0)
res = model.predict(x)
print( res )
print( np.argmax(res) )
plt.imshow(x_test[n], cmap=plt.cm.binary)
plt.show()
Please tell me what to do so that NN could correctly recognize the numbers on my photos. Thanks
Whenever you perform machine learning or any form of prediction, you have to make sure that your model is trained on data that is similar to the data you want to perform predictions on. Here I assume that you trained on data with a black background and white text, like the MNIST set.
Therefore, you should invert the input data so that it is similar to that you have trained the model on.
How can I generate attention maps for 3D grayscale MRI data after training with vision transformer for a classification problem?
My data shape is (120,120,120) and the model is 3D ViT. For example:
img = nib.load()
img = torch.from_numpy(img)
model = torch.load...
model.eval()
output, attn = model(img)
After this, because I have 6 transformer layers and 12 heads, so the attn I got the shape that is
(6,12,65,65)
Then I don't know how to apply this to original 3D grayscale images. I got several examples online that only deal with images from ImageNet.
For example:
https://github.com/tczhangzhi/VisionTransformer-Pytorch
https://github.com/jacobgil/vit-explain
Can anyone help me with this?
I would guess your ViT splits your volumes to 4x4x4 tokens and adds a single cls token; overall 65 tokens per volume.
If you want to see how the cls token attends to all other 64 tokens for a specific layer and a specific head, you can:
import matplotlib.pyplot as plt
layer = 4 # check the 5th layer
head = 7 # check the 7th head
cls_attn = attn[layer, head, 0, 1:].reshape(4, 4, 4)
fig, ax = plt.subplots(2, 2)
for z in range(cls_attn.shape[0]):
ax.flat[z].matshow(cls_attn[z, ...])
plt.show()
If i enlarge my dataset using augmentations, I get a better result?
For example, I have 1 class, it is a dog class and 4 images for it. I applied augmentations to 4 images. Now some of these images are augmented, some are not. But I still have 4 images.
Will it be more efficient if I add to augmented images original images? -> It will be 8 images in dataset. I tried to do this thing, changing my "Custom Dataset", but if I have lot of images (100000) then Collab tell me bye bye, because of memory ran out.
Is it matter to make augmentations before creating dataset and after creating dataset in training loop like this:
for x, y in train_loader:
aug_x = aug(x)
...
output = model(aug_x)
loss = ...
loss.backward()
...
I suppose, I need to choose 1 way to apply augmentations to my images either before dataset or in the training loop. Am I wrong? Write below ypur suggestions with code. Thank you!
Usually approprietly chosen augmentations leads to better results.
You are right the preliminary augmentation of your dataset and saving augmented images consumes all the disk memory in the case of big datasets.
So it makes sense to apply augmentations dynamically, on-the-fly.
Simple pytorch example:
import cv2
import numpy as np
from torch.utils.data import DataLoader, Dataset
class MyDataset(Dataset):
def __init__(self, image_paths, size):
self._image_paths = image_paths
self._size = size
def __getitem__(self, idx):
path = self._image_paths[0]
image = cv2.imread(path)
# Insert here your augmentations
if np.random.rand() < 0.5:
image = cv2.flip(image, 0)
if np.random.rand() < 0.5:
image = cv2.flip(image, 1)
return image
def __len__(self):
return self._size
image_paths = ["1.png"]
loader = DataLoader(MyDataset(image_paths, 10), batch_size=4)
for batch in loader:
batch_images = np.hstack([image for image in batch])
cv2.imshow("image", batch_images)
cv2.waitKey()
One special case when this approach will work poorly is when the augmentation process takes a lot of time. For example when you need to render some 3D objects using complex pipeline with Blender. Such augmentations will became the bottleneck during the training so it makes sense to save the augmented data to disk first and the use it to enlarge dataset during training.
The choice of augmentations heavily depends on the domain of your data. Small augmentations could lead to small to no accuracy gains. Very heavy augmentations could distort the training distribution too big which results to decrease in quality.
If you are interested in image augmentations you can check out these projects:
https://github.com/aleju/imgaug
https://github.com/albumentations-team/albumentations
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
I'm studying on a deep learning(supervised-learning) to estimate depth images from monocular images.
And the dataset currently uses KITTI data. RGB images (input image) are used KITTI Raw data, and data from the following link is used for ground-truth.
In the process of learning a model by designing a simple encoder-decoder network, the result is not so good, so various attempts are being made.
While searching for various methods, I found that groundtruth only learns valid areas by masking because there are many invalid areas, i.e., values that cannot be used, as shown in the image below.
So, I learned through masking, but I am curious about why this result keeps coming out.
and this is my training part of code.
How can i fix this problem.
for epoch in range(num_epoch):
model.train() ### train ###
for batch_idx, samples in enumerate(tqdm(train_loader)):
x_train = samples['RGB'].to(device)
y_train = samples['groundtruth'].to(device)
pred_depth = model.forward(x_train)
valid_mask = y_train != 0 #### Here is masking
valid_gt_depth = y_train[valid_mask]
valid_pred_depth = pred_depth[valid_mask]
loss = loss_RMSE(valid_pred_depth, valid_gt_depth)
As far as I can understand, you are trying to estimate depth from an RGB image as input. This is an ill-posed problem since the same input image can project to multiple plausible depth values. You would need to integrate certain techniques to estimate accurate depth from RGB images instead of simply taking an L1 or L2 loss between an RGB image and its corresponding depth image.
I would suggest you to go through some papers in estimating depth from single images such as: Depth Map Prediction from a Single Image using a Multi-Scale Deep Network where they use a network to first estimate the global structure of the given image and then use a second network that refines the local scene information. Instead of taking a simple RMSE loss, as you did, they use a scale-invariant error function in which the relationship between points is measured.
I'm currently trying to implement the CBOW model on managed to get the training and testing, but am facing some confusion as to the "proper" way to finally extract the weights from the model to use as our word embeddings.
Model
class CBOW(nn.Module):
def __init__(self, config, vocab):
self.config = config # Basic config file to hold arguments.
self.vocab = vocab
self.vocab_size = len(self.vocab.token2idx)
self.window_size = self.config.window_size
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.config.embed_dim)
self.linear = nn.Linear(in_features=self.config.embed_dim, out_features=self.vocab_size)
def forward(self, x):
x = self.embed(x)
x = torch.mean(x, dim=0) # Average out the embedding values.
x = self.linear(x)
return x
Main process
After I run my model through a Solver with the training and testing data, I basically told the train and test functions to also return the model that's used. Then I assigned the embedding weights to a separate variable and used those as the word embeddings.
Training and testing was conducted using cross entropy loss, and each training and testing sample is of the form ([context words], target word).
def run(solver, config, vocabulary):
for epoch in range(config.num_epochs):
loss_train, model_train = solver.train()
loss_test, model_test = solver.test()
embeddings = model_train.embed.weight
I'm not sure if this is the correct way of going about extracting and using the embeddings. Is there usually another way to do this? Thanks in advance.
Yes, model_train.embed.weight will give you a torch tensor that stores the embedding weights. Note however, that this tensor also contains the latest gradients. If you don't want/need them, model_train.embed.weight.data will give you the weights only.
A more generic option is to call model_train.embed.parameters(). This will give you a generator of all the weight tensors of the layer. In general, there are multiple weight tensors in a layer and weight will give you only one of them. Embedding happens to have only one, so here it doesn't matter which option you use.