How do I add a caffe slice layer to crop images? - deep-learning

For example instead of inferring a batch of 64 28x28 images and adding 64 results together why can I add a layer to the network and crop out these 64 images from a 224x224 input image? It seems this would be more elegant and faster.
gif of different lighting
How do you do this? I find it odd I can’t find slice examples like this and I am guess I must be using the wrong terms or someone asking the question wrong.
I tried the slice layer but it keep wanting to slice the 8 bit gray. For example to create four 224x224 2bit images.
Any Ideas?
By the way my application is really cool! I am doing unsupervised grouping of 3D objects using many different lighting angles. This eliminates manual labeling of classes!
https://github.com/GemHunt/lighting-augmentation
Thanks Much! Paul Krush

This was answered for now on the Caffe Google Group:
https://groups.google.com/forum/#!topic/caffe-users/_uii8kTMOdM
One or more of these answers will work for me:
1.) Python pre and post processing is better
2.) Check out windowing
3.) Try the reshape layer (1,1,224,224)->(1,64,28,28)

Related

U-net how to understand the cropped output

I'm looking for U-net implementation for landmark detection task, where the architecture is intended to be similar to the figure above. For reference please see this: An Attention-Guided Deep Regression Model for Landmark Detection in Cephalograms
From the figure, we can see the input dimension is 572x572 but the output dimension is 388x388. My question is, how do we visualize and correctly understand the cropped output? From what I know, we ideally expect the output size is the same as input size (which is 572x572) so we can apply the mask to the original image to carry out segmentation. However, from some tutorial like (this one), the author recreate the model from scratch then use "same padding" to overcome my question, but I would prefer not to use same padding to achieve same output size.
I couldn't use same padding because I choose to use pretrained ResNet34 as my encoder backbone, from PyTorch pretrained ResNet34 implementation they didn't use same padding on the encoder part, which means the result is exactly similar as what you see in the figure above (intermediate feature maps are cropped before being copied). If I would to continue building the decoder this way, the output will have smaller size compared to input image.
The question being, if I want to use the output segmentation maps, should I pad its outside until its dimension match the input, or I just resize the map? I'm worrying the first one will lost information about the boundary of image and also the latter will dilate the landmarks predictions. Is there a best practice about this?
The reason I must use a pretrained network is because my dataset is small (only 100 images), so I want to make sure the encoder can generate good enough feature maps from the experiences gained from ImageNet.
After some thinking and testing of my program, I found that PyTorch's pretrained ResNet34 didn't loose the size of image because of convolution, instead its implementation is indeed using same padding. An illustration is
Input(3,512,512)-> Layer1(64,128,128) -> Layer2(128,64,64) -> Layer3(256,32,32)
-> Layer4(512,16,16)
so we can use deconvolution (or ConvTranspose2d in PyTorch) to bring the dimension back to 128, then dilate the result 4 times bigger to get the segmentation mask (or landmarks heatmaps).

Locate/Extract Patches from an Image

I have an image(e.g. 60x60) with multiple items inside it. Items are in the shape of square boxes, with say 4x4 dimensions, and are randomly placed within the image. The boxes(items) themselves are created with random patterns, some random pixels switched on and others switched off. So, it could be the same box repeated twice(or more in case of more than 2 items) in the image or could be entirely different.
I'm looking to create a deep learning model that could take in the original image(60x60) and output all the patches in the image.
This is all I have for now, but I can definitely share more details as the discussion starts. I'd be interested to weigh in different options that can help me achieve this objective. Thanks.
I would solve this using object detection. First I would train a network to detect those box like objects by cutting out patches of those objects. Then I would run a Faster R-CNN or something like this on it.
You might want to take a look at the stanford lecture on detection (slides here: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf).

How to convert image of handwriting into pen coordinates?

I have a image with binary values (black and white) at each pixel. I want to convert this into an ordered list of pen coordinates (X,Y) which traces the path of the pen. I want to do this so that I can use an API which only takes pen coordinates as an input.
Is there any library or straightforward way to do this? Thanks!
This question https://graphicdesign.stackexchange.com/questions/25165/how-can-i-convert-a-jpg-signature-into-strokes describes using a vector graphic convertor to do this. It suggests first converting the pixels to binary values, and then using the autotrace tool.
autotrace -centerline -color-count 2 -output-file output.svg -output-format SVG input.png
I'm not sure what the best way to get SVG files into (X,Y) coordinates is, you can do a rough job by parsing the XML directly.
Another approach could be to look through the submissions in the Kaggle competition for this same problem: https://www.kaggle.com/c/icdar2013-stroke-recovery-from-offline-data . I haven't looked into these myself, though I imagine they'd result in better performance.

Drawing shapes versus rendering images?

I am using Pygame 1.9.2a with Python 2.7 for designing an experiment and have been so far using Pygame only on a need basis and am not familiar with all Pygame classes or concepts (Sprites, for instance, I have no idea about).
I am required to draw many (45 - 50 at one time) shapes on the screen at different locations to create a crowded display. The shapes vary from displaced Ts , displaced Ls to line intersections. [ Like _| or † or ‡ etc.]! I'm sorry that I am not able to post an image of this because I apparently do not have a reputation of 10, which is necessary to post images.
I also need these shapes in 8 different orientations. I was initially contemplating generating point lists and using these to draw lines. But, for a single shape, I will need four points and I need 50 of these shapes. Again, I'm not sure how to rotate these once drawn. Can I use the Pygame Transform or something? I think they can be used, say on Rects. Or will I have to generate points for the different orientations too, so that when drawn, they come out looking rotated, that is, in the desired orientation?
The alternative I was thinking of was to generate images for the shapes in GIMP or some software like that. But, for any screen, I will have to load around 50 images. Will I have to use Pygame Image and make 50 calls for something like this? Or is there an easier way to handle multiple images?
Also, which method would be a bigger hit to performance? Since, it is an experiment, I am worried about timing precision too. I don't know if there is a different way to generate shapes in Pygame. Please help me decide which of these two (or a different method) is better to use for my purposes.
Thank you!
It is easer to use pygame.draw.rect() or pygame.draw.polygon() (because you don't need to know how to use GIMP or InkScape :) ) but you have to draw it on another pygame.Surface() (to get bitmap) and than you can rotate it, add alpha (to make transparet) and than you can put it on screen.
You can create function to generate images (using Surface()) with all shapes in different orientations at program start. If you will need better looking images you can change function to load images created in GIMP.
Try every method on your own - this is the best method to check which one is good for you.
By the way: you can save generated images pygame.image.save() and then load it. You can have all elements on one image and use part of image Surface.get_clip()

Optical character recognition

Hey everyone,
I'm trying to create a program in Java that can read numbers of the screen, and also recognise images on the screen. I was wondering how i can achieve this?
The font of the numbers will always be the same. I have never programmed anything like this before, but my idea of how it works is to have the program take a screenshot, then overlay the image of the numbers with the section of the screenshot image and check if they match, repeating this for each numbers. If this is the correct way to do this, how would i put that in code.
Thanks in advance for any help.
You could always train a neural net to do it for you. They can get pretty accurate sometimes. If you use something like Matlab it actually has capabilities for that already. Apparently there's a neural network library for java (http://neuroph.sourceforge.net/) although I've never used it personally.
Here's a tutorial about using neuroph: http://www.certpal.com/blogs/2010/04/java-neural-networks-and-neuroph-a-tutorial/
You can use a neural network, support vector machine, or other machine learning construct for this. But it will not do the entire job. If you do a screen shot, you are going to be left with a very large image that you will need to find the individual characters on. You also need to deal with the fact that the camera might not be pointed straight at the text that you want to read. You will likely need to use a series of algorithms to lock onto the right parts of the image and then downsample it in a way that size becomes neutral.
Here is a simple Java applet I wrote that does some of this.
http://www.heatonresearch.com/articles/42/page1.html
It lets you draw on a relatively large area and locks in on your char. Then it recognizes it. I am using the alphabet, but digits should be easier. The complete Java source code is included.
One simpler approach could be to use template matching. If the fonts are same, and/or the size (in pixels)is known, then simple template matching can do the job for you. ifsize of input is unknown, you might have to create copies of images at different scales and do the matching at each scale.
One with the extreme value(highest or lowest depending on the method you follow for template matching) is your result.
Follow this link for details