from what I know is that object detection uses both (Localization + Classification) but this only for binary classification
but what if I want to detect an object and its attributes?
For example, detect a face
if a mask is worn correctly return 0, if a mask worn improperly 1, otherwise 2
if the person eats return 0 otherwise 1
if the person smokes return 0 otherwise 1
what kind this problem is? and how can I annotate and train it?
I tried to use multi bounding boxes but I don't think it is a good idea
Related
I am generating some summaries using a fine-tuned BART model, and I've noticed something strange. If I feed the labels to the model, it will always generate summaries of the same length of the label, whereas if I do not pass the labels to the model, it generates outputs of length 1024 (max BART seq length). This is unexpected, so I'm trying to understand if there is any problem / bug with the reproducible example below
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model=AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn')
tokenizer=AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
sentence_to_summarize = ['This is a text to summarise. I just went for a walk in the park and saw very large crowds gathering to watch an impromptu football match']
encoded_dict = tokenizer.batch_encode_plus(sentence_to_summarize, return_tensors='pt', max_length=1024, padding='max_length')
input_ids = encoded_dict['input_ids']
attention_mask = encoded_dict['attention_mask']
label = tokenizer.encode('I went to the park', return_tensors='pt')
Notice the following two cases.
Case 1:
output = model(input_ids=input_ids, attention_mask=attention_mask)
print(output['logits'].shape)
shape printed is torch.Size([1, 1024, 50264])
Case 2
output = model(input_ids=input_ids, attention_mask=attention_mask, labels=label)
print(output['logits'].shape)
shape printed is torch.Size([1, 7, 50264]) where 7 is the length of the label 'I went to the park' (including start and end tokens).
Ideally the summarization model would learn when to generate the EOS token, but this should not always lead to summaries of identical length of the gold output (i.e. the label). Why is the label length influencing the model output in this way?
I would expect the only difference between cases 1 and 2 being that in the second case the output also contains the loss value, but I wouldn't expect this to influence the logits in any way
Original example not use label parameter
https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/bart#transformers.BartForConditionalGeneration.forward.example
label parameter is optional and i think not used for summerizing
https://huggingface.co/docs/transformers/v4.22.1/en/model_doc/bart#transformers.BartForConditionalGeneration.forward.labels
This is my first foray into the world of object recognition. I have successfully trained a model on yolo with images that I have found on Google and annotated myself in CVAT.
My questions are as follows.
a) How do I train the model to ignore some special variant that I am specifically NOT interested in detecting? Say I am getting false positives because something looks similar to one of my objects, and I want to train so that these are not detected. Does it simply work to include images that contain the unwanted object into the training set, but don't annotate the unwanted object?
b) If so, am I right in assuming that if I train on annotated images that have somehow missed occasional instances of desired objects, is that effectively telling the training engine that I'm not interested in that object? In other words, is it therefore BAD if images don't have every single instance of desired objects annotated?
c) If I happen to include an image in my training set with an empty annotation file, and there are desired objects in that image, that effectively disincentivizes the training engine to find those in future?
Thanks for any thoughts.
a) This is true. The model will consider space inside bounding boxes as positive for a certain class during training, and space outside the boxes for the class negative for that class.
b) See a, this is indeed the case.
c) Empty annotation files will be used during training, but the model will train on that image as a 'background' class, so these are negatives too.
So, in short, annotate all instances of objects of a certain class and maybe add 'background images' as negative examples to disincentivize those.
As I learn more and more about ML (I am a mobile DEV) I'm starting to form an analogy in my head. I would like the communities opinion / validation.
As a front end DEV you have a backend and an API that you can make requests to. The standard format for the inputs and outputs to the API is JSON.
I'm running into a problem with ML Models that I am trying to use where I don't know how to read the expected input (API) and I don't know how to decode the expected output.
So far I my experience has been fragmented because some models say "Give me an image of [1,2,120,120]" or something like that.
To analogize, is there a unified way to define inputs and outputs for a ML model like JSON unifies the inputs and outputs for an backend API?
If so, what are some rules one must follow to encode and decode data into this format?
Assuming this "ML Model" is in the context of running an input through say a trained pytorch model's forward pass to get an output, the unified way to define inputs and outputs for an ML model are through Tensors. Tensors are essentially a multi-dimensional matrix containing elements of a single data type. Think multi-dimensional lists with a single data type.
Tensors:MLModels::JSON:WebAPI
An Example using an Object Detector
Model
Let's say your model example with the image is an object detector model that takes in an image as input and outputs either dog or cat
The input would usually be:
A tensor representation of an Image with the shape of [1, 2, 120, 120] where 1 represents the batch size, 2 is the dimension of your rgb channels, and 120x120 is the width and height of an image.
The output would usually be:
A normalized 2 dimensional tensor like [0.7, 0.3] where index 0 represents the probability of the image depicting a dog and index 1 represents the probability it's a cat.
Encoding and Decoding
Decoding the output to a string like "dog" or "cat" is obvious.
Encoding an image is slightly less obvious. At its heart, the format
of an image is that of a tensor...a multi-dimensional matrix
containing a single datatype. So is still intuitive to encode an
image in the form of a JPEG or PNG to a tensor representation through
the rgb channel dimensions and the pixel values for each channel.
Typically image files are loaded in using libraries and methods like
the python imaging library and pytorch's
torchvision.transforms.ToTensor().
This example is very specific to an object detector type model, but most supervised ML models will output a tensor like the above or a one-hot label. Most ML models in general will always have data inputs and outputs that can be represented as Tensors.
I want to predict the trajectory of a ball falling. That trajectory is parabolic. I know that LSTM may be too much for this (i.e. a simpler method could suffice).
I thought that we can do this with 2 LSTM layers and a Dense layer at the end.
The end result that I want is to give the model 3 heights h0,h1,h2 and let it predict h3. Then, I want to give it h1, h2, and the h3 it outputted previously to predict h4, and so on, until I can predict the whole trajectory.
Firstly, what would the input shape be for the first LSTM layer ?
Would it be input_shape = (3,1) ?
Secondly, would the LSTM be able to predict a parabolic path ?
I am getting almost a flat line, not a parabola, and I want to rule out the possibility that I am misunderstanding how to feed and shape input.
Thank you
The input shape is in the form (samples, timeSteps, features).
Your only feature is "height", so features = 1.
And since you're going to input sequences with different lengths, you can use timeSteps = None.
So, your input_shape could be (None, 1).
Since we're going to use a stateful=True layer below, we can use batch_input_shape=(1,None,1). Choose the amount of "samples" you want.
Your model can predict the trajectory indeed, but maybe it will need more than one layer. (The exact answer about how many layers and cells depend on knowing how the match inside LSTM works).
Training:
Now, first you need to train your network (only then it will be able to start predicting good things).
For training, suppose you have a sequence of [h1,h2,h3,h4,h5,h6...], true values in the correct sequence. (I suggest you have actually many sequences (samples), so your model learns better).
For this sequence, you want an output predicting the next step, then your target would be [h2,h3,h4,h5,h6,h7...]
So, suppose you have a data array with shape (manySequences, steps, 1), you make:
x_train = data[:,:-1,:]
y_train = data[:,1:,:]
Now, your layers should be using return_sequences=True. (Every input step produces an output step). And you train the model with this data.
A this point, whether you're using stateful=True or stateful=False is not very relevant. (But if true, you always need model.reset_state() before every single epoch and sequence)
Predicting:
For predicting, you can use stateful=True in the model. This means that when you input h1, it will produce h2. And when you input h2 it will remember the "current speed" (the state of the model) to predict the correct h3.
(In the training phase, it's not important to have this, because you're inputting the entire sequences at once. So the speed will be understood between steps of the long sequences).
You can se the method reset_states() as set_current_speed_to(0). You will use it whenever the step you're going to input is the first step in a sequence.
Then you can do loops like this:
model.reset_states() #make speed = 0
nextH = someValueWithShape((1,1,1))
predictions = [nextH]
for i in range(steps):
nextH = model.predict(nextH)
predictions.append(nextH)
There is an example here, but using two features. There is a difference that I use two models, one for training, one for predicting, but you can use only one with return_sequences=True and stateful=True (don't forget to reset_states() at the beginning of every epoch in training).
I have pairs of movie witch contains 2783 features.
The vector is defined as: if the feature is in the movie it's 1 otherwise it's 0.
Example :
movie 1 = [0,0,1,0,1,0,1 ...] & movie 2 = [1,0,1,1,1,0,1 ...]
Each pair has for label 1 or 0.
movie1,movie2=0
movie1,movie4=1
movie2,movie150=0
The input is similar to SGNS (Skip gram negative sampling) word2vec model.
My goal is to find similarity between programs and learn embedding of each movie.
I'd to build a kind of 'SGNS implementation with keras'. However my input is not one hot and I can't use the Embedding layers. I tried to use Dense layers and merge them with a dot product. I'm not sure about the model architecture and I got errors.
from keras.layers import Dense,Input,LSTM,Reshape
from keras.models import Model,Sequential
n_of_features = 2783
n_embed_dims = 20
# movie1 vectors
word= Sequential()
word.add(Dense(n_embed_dims, input_dim=(n_words,)))
# movie2 vectors
context = Sequential()
context.add(Dense(n_embed_dims, input_dim=n_words,))
model = Sequential()
model.add(keras.layers.dot([word, context], axes=1))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='mean_squared_error')
If someone has an idea how to implement it.
If you're not wedded to Keras, you could probably model this by turning each movie into a synthetic 'document' with tokens for each feature-that-is-present. Then, use a 'Paragraph Vectors' implementation in pure PV-DBOW mode to learn a vector for each movie.
(In pure PV-DBOW, dense doc-vectors are learned to predict each word in a document, without regard to order/word-adjacency/etc. It is a bit like skip-gram, but the training pairs are not 'word to every nearby word' but 'doc-token to every in-doc word'.)
In gensim, the Doc2Vec class with initialization parameter dm=0 uses pure PV-DBOW mode.