Imagine you have a real image, and you put it through a 2D FFT.
Usually this yields a cross-like structure (edge effect) and some real content, depending on the image.
Imagine the original image contains two spots with bad lighting, which for example yield two different frequencies in the fourier domain.
Imagine another, different image, put through fft, with a single spot due to bad lighting; this spot yields the same frequencies in the fourier domain, as the spots in the first image combined.
How would one distinguish those two amplitude spectra? In my opinion, there is no way of knowing the location of a certain frequency in the fourier domain. Only a direction information is retained, i.e. a horizontal line in the image will yield a vertical line in the fourier amplitude spectrum.
So the information I am after has to be hidden in the phase spectrum. How can I recover such a specific piece of information from a phase spectum that looks like noise?
Related
I was reading this report which I faced a problem that I have it too. I have trained a model by yolov3 algorithm and sometimes it predicts more than 1 bounding box over one object. I was wondering what is the cause of this issue. Thanks for your replies in advance.Here in this image you can see that the top right rope-end is detected by two bounding boxes
It predicts it because it must predict something with a probability of course. The algorithm just run it's computations for each boundary box in each cell in the grid of the picture and produces the output. There is no way to know which one is true. That's why there is an algorithm called Non-Max Suppression that can eliminate the redundant boxes, but it's not accurate 100%
Here are two pictures before and after applying Non-Max Suppression algorithm.
The problem is that you eliminate a box if its intersection with the main box (the box that have the highest probability) is over a certain threshold, this threshold may not be enough to eliminate the box like the case of the girl in the picture.
Suppose I have a CNN which is trained for classifying images of different animals, the output of such model will be a point (output point) in a n spatial dimension, where n is the number of animal classes the model is trained on; then that output is transformed in a way to convert it into a one-hot vector of n parameters, giving then the correct label for the image from the point of view of the CNN, but let's stick with the n dimensional point, which is the concept of an input image.
Suppose then that I want to take that point and transform it in a way so that the final output is an image with constraint width and height (the dimensions should be the same with different input images) which outputs the same point as the input image's, how do I do that?
I'm basically asking for the methods used (training mostly) for this kind of task, where an image must be reconstructed based on the output point of the CNN -I know the image will never be identical, but I'm looking for images that generate the same (or at least not so different) output point as a input image when that point is inputted to the CNN-. Take in mind that the input of the model I'm asking for is n and the output is a two (or three if it's not in grayscale) dimensional tensor. I noticed that deepdream does exactly this kind of thing (I think), but every time I put "deepdream" and "generate" in Google, an online generator is almost always shown, not the actual techniques; so if there are some answers to this I'd love to hear about them.
The output label does not contain enough information to reconstruct an entire image.
Quoting from the DeepDream example ipython notebook:
Making the "dream" images is very simple. Essentially it is just a gradient ascent process that tries to maximize the L2 norm of activations of a particular DNN layer.
So the algorithm modifies an existing image such that outputs of certain nodes in the network (can be in an intermediate layer, not necessarily the output nodes) become large. In order to do that, it has to calculate the gradient of the node output with respect to the input pixels.
I am trying to train my model which classifies images.
The problem I have is, they have different sizes. how should i format my images/or model architecture ?
You didn't say what architecture you're talking about. Since you said you want to classify images, I'm assuming it's a partly convolutional, partly fully connected network like AlexNet, GoogLeNet, etc. In general, the answer to your question depends on the network type you are working with.
If, for example, your network only contains convolutional units - that is to say, does not contain fully connected layers - it can be invariant to the input image's size. Such a network could process the input images and in turn return another image ("convolutional all the way"); you would have to make sure that the output matches what you expect, since you have to determine the loss in some way, of course.
If you are using fully connected units though, you're up for trouble: Here you have a fixed number of learned weights your network has to work with, so varying inputs would require a varying number of weights - and that's not possible.
If that is your problem, here's some things you can do:
Don't care about squashing the images. A network might learn to make sense of the content anyway; does scale and perspective mean anything to the content anyway?
Center-crop the images to a specific size. If you fear you're losing data, do multiple crops and use these to augment your input data, so that the original image will be split into N different images of correct size.
Pad the images with a solid color to a squared size, then resize.
Do a combination of that.
The padding option might introduce an additional error source to the network's prediction, as the network might (read: likely will) be biased to images that contain such a padded border.
If you need some ideas, have a look at the Images section of the TensorFlow documentation, there's pieces like resize_image_with_crop_or_pad that take away the bigger work.
As for just don't caring about squashing, here's a piece of the preprocessing pipeline of the famous Inception network:
# This resizing operation may distort the images because the aspect
# ratio is not respected. We select a resize method in a round robin
# fashion based on the thread number.
# Note that ResizeMethod contains 4 enumerated resizing methods.
# We select only 1 case for fast_mode bilinear.
num_resize_cases = 1 if fast_mode else 4
distorted_image = apply_with_random_selector(
distorted_image,
lambda x, method: tf.image.resize_images(x, [height, width], method=method),
num_cases=num_resize_cases)
They're totally aware of it and do it anyway.
Depending on how far you want or need to go, there actually is a paper here called Spatial Pyramid Pooling in Deep Convolution Networks for Visual Recognition that handles inputs of arbitrary sizes by processing them in a very special way.
Try making a spatial pyramid pooling layer. Then put it after your last convolution layer so that the FC layers always get constant dimensional vectors as input . During training , train the images from the entire dataset using a particular image size for one epoch . Then for the next epoch , switch to a different image size and continue training .
First time asking a question on the stack exchange, hopefully this is the right place.
I can't seem to develop a close enough approximation algorithm for my situation as I'm not exactly the best in terms of 3D math.
I have a 3d environment in which I can access the position and rotation of any object, including my camera, as well as run trace lines from any two points to get distances between a point and a point of collision. I also have my camera's field of view. I do not have any form of access to the world/view/projection matrices however.
I also have a collection of 2d images that are basically a set of screenshots of the 3d environment from the camera, each collection is from the same point and angle and the average set is taken at about an average of a 60 degree angle down from the horizon.
I have been able to get to the point of using "registration point entities" that can be placed in the 3d world that represent the corners of the 2d image, and then when a point is picked on the 2d image it is read as a coordinate with range 0-1, which is then interpolated between the 3d positions of the registration points. This seems to work well, but only if the image is a perfect top down angle. When the camera is tilted and another dimension of perspective is introduced, the results become more grossly inaccurate as there no compensation for this perspective.
I don't need to be able to calculate the height of a point, say a window on a sky scraper, but at least the coordinate at the base of the image plane, or which if I extend a line out from my image from a specified image space point I need at least the point that the line will intersect with the ground if there was nothing in the way.
All of the material I found about this says to just deproject the point using the world/view/projection matrices, which I find straightforward in itself except I don't have access to these matrices, just data I can collect at screenshot time and other algorithms use complex maths I simply don't grasp yet.
One end goal of this would be able to place markers in the 3d environment where a user clicks in the image, while not being able to run a simple deprojection from the user's view.
Any help would be appreciated, thanks.
Edit: Herp derp, while my implementation for doing so is a bit odd due to the limitations of my situation, the solution essentially boiled down to ananthonline's answer about simply recalculating the view/projection matrices.
Between position, rotation and FOV of the camera, could you not calculate the View/Projection matrices of the camera (songho.ca/opengl/gl_projectionmatrix.html) - thus allowing you to unproject known 3D points?
I am trying to randomly generate a directed graph for the purpose of making a puzzle game similar to the ice sliding puzzles from pokemon.
This is essentially what I want to be able to randomly generate: http://bulbanews.bulbagarden.net/wiki/Crunching_the_numbers:_Graph_theory
I need to be able to limit the size of the graph in an x and y dimension. In the example in the link, it would be restricted to an 8x4 grid.
The problem I am running in to is not randomly generating the graph, but randomly generating a graph which I can properly map out in a 2d space, since I need something (like a rock) on the opposite side of a node, to make it visually make sense when you stop sliding. The problem with this is sometimes the rock ends up in the path between two other nodes or possibly on another node itself, which causes the entire graph to become broken.
After discussing the problem with a few people I know, we came to a couple of conclusions that may lead to a solution. Including the obstacles in the grid as part of the graph when constructing it. Start out with a fully filled grid and just draw a random path and delete out blocks that will make that path work, though the problem then becomes figuring out which ones to delete so that you don't accidentally introduce an additional, shorter path. We were also thinking a dynamic programming algorithm may be beneficial, though none of us are too skilled with creating dynamic programming algorithms from nothing. Any ideas or references about what this problem is officially called (if it's an official graph problem) would be most helpful.
I wouldn't look at it as a graph problem, since as you say the representation is incomplete. To generate a puzzle I would work directly on a grid, and work backwards; first fix the destination spot, then place rocks in some way to reach it from one or more spots, and iteratively add stones to reach those other spots, with the constraint that you never add a stone which breaks all the paths to the destination.
You might want to generate a planar graph, which means that the edges of the graph will not overlap each other in a two dimensional space. Another definition of planar graphs ist that each planar graph does not have any subgraphs of the type K_3,3 (complete bi-partite with six nodes) or K_5 (complete graph with five nodes).
There's a paper on the fast generation of planar graphs.