I tried following the Siamese Network MNIST example with Caffe, and many other stackoverflow posts, or here on Google groups, but the information is always incomplete or a dead end. All I am trying to do is to feed a siamese network 2 RGB images to calculate for similarity.
What I've done so far is that I concatenated the 2 RGB images into one, converted it to leveldb, edited the slicing layer in "mnist_siamese_train_test.prototxt" to "slice_point: 3". From what I understand now is that the problem will be with the channels. How do I fix this issue, I havent found any useful resource to tell me how to do this, or fits my case. Please let me know if there is another way totally of just feeding the network directories and lists instead of leveldb and concatenating the images. Let me know if there is anything that needs further explanation.
you can find the answer in details in this thread, in short, you have two options:
Use a slice layer to slice the blob you created in the lmdb, as you pointed in the question, you have slice_point:3 and a 6 "channel" image (3 for each image) it should split it in 2 images with 3 channels each.
Use 2 different InputDataLayers, each one with a different file, you can see a working example in the thread.
Now, as pointed out, you seem to be doing the right thing, can you copy-paste your error and your .prototxt file here?
Also check if the dimension along which you are slicing is correct
Related
I am working on a project for detecting abandoned luggage in train stations and airports. Is there some dataset that contains all kinds of bags and luggage? I searched a lot but I can't find any, and I would really appreciate if someone can help me.
Thanks!
I was also looking for this kind of dataset and haven't found any.
So my solution was to build one from the images of the COCO dataset.
download the metadata and only save labelled images that have one of these 3 classes: backpack, suitcase, handbag (further I'd refer to them as just bag).
get rid of the pictures where the bag is carried by a person. To do it, write a script that considers the bounding rects of all objects in the picture; if at least one bag in the picture has its bounding box intercepting with the bounding box of the object person, that's a good indicator that the bag is in person's hands/back/shoulder. The remaining bags may be considered abandoned.
save links to these images in a separate file and download them with CURL.
Note that this approach still requires manual clean-up after the images are downloaded, but that's the best you could come up with in absence of ready-to-use datasets.
I want to train Faster R-CNN network with my own images to detect faces. I have checked quite a few Github libraries, but this is the example of the training file I always find:
/data/imgs/img_001.jpg,837,346,981,456,cow
/data/imgs/img_002.jpg,215,312,279,391,cat
But I can't find an example how to train with images containing couple objects. Should it be:
1) /data/imgs/img_001.jpg,837,346,981,456,cow,215,312,279,391,cow
or
2) /data/imgs/img_001.jpg,837,346,981,456,cow
/data/imgs/img_001.jpg,215,312,279,391,cow
?
I just could not help myself but quote FarCry3 here: "The definition of insanity is doing the same thing over and over and expecting different results."
(Note that this is purely in an entertaining context, and not meant to insult you in any way; I would not take the time to answer your question if I didn't think it worthwile)
In your second example, you would feed the exact same input data, but require the network to learn two different outcomes. But, as you already noted, it is not very common for many of the libraries to support multiple labels per image.
Oftentimes, this is purely done for the sake of simplicity, as it requires you to change your metrics, to accomodate for multiple outputs: Instead of having one-hot encoded targets, you now could have multiple "targets".
This is even more challenging in the task of object detection (and not object classification, as described before), since you now have to decide how you represent your targets.
If it is possible at all, I would personally restrict myself to labeling one class per image, or have a look at another image library that does support that, since the effort of rewriting that much code is probably not worth the minute improvement in the results.
I have been trying this problem for weeks but to no avail.
My problem is:
Deep Learning Model has the following information:
INPUT: Sequence of Images
OUTPUT: What is happening in the image, i.e. categorise what is the activity happening from a sequence of 10 activities.
I have two cameras recording the same activity from two views, how could I combine those two views to improve the accuracy?
I think you should use DELF features, extract features of both similar images and combine them.
How to combine the two views is fully dependent on your understanding of the problem. Let me give you two different examples,
CASE I: when you review your training data, you can easily tell which camera is better for some data. e.g. one camera may capture everything useful, while the other camera may not due to possible occlusions (note: I am not saying one camera is always better than the other). In this case, you may use a later fusion technique to only fuse the two resulting features representing the sequences from two cameras.
CASE II: it is difficult for you to tell which camera is better. This basically indicates that you may not see performance boost after you consider both cameras, but maybe some small improvement.
Finally, when you say two cameras, is it possible for you do something like the binocular stereo vision? In this case, you may obtain the extra depth information that is not included in any single camera, and maybe helpful for the recognition task.
I used caffe/examples/cifar10 to train models for classification and I want to use this result to do visualization. But I found that cifar10 images are all 32*32 which is too small to do per unit visualization. Now I want to try to use another dataset which is Imagenet.
But in my case instead of a thousand class I want only ten of the class just like cifar-10. I found that the data IMAGENET provide is too big to download it and extract those ten class. Is there any possible way that I could use the full url image downloaded from the official imagenet website. and download the selected 10 classes to store in my disk? Because I don't see any label on the text file(image full url).
If you poke around, I believe that you'll find a text file that lists each URL, along with the label for that URL -- an integer in the range 0-999.
However, I don't know of a site that maps ILSVRC data classes to CIFAR classes. I poked around on the Internet for a while and came up with nothing. You may wind up having to examine all 1000 classes and create your own mapping.
I have an aerial, and a satellite image and I'm trying to measure the similarity of both images, and get a factor of how similar they are. What should I look into? I checked fine-grained image similarity, but the problem is I don't know what will be the negative image since my two images are specific. So what should I read or check out?
Check Learning a Similarity Metric Discriminatively, with Application to Face Verification
A tutorial on how to implement the Siamese network described in the paper is here