What does Nvidia mean when they say samples per pixel in regards to DLSS? - deep-learning

"NVIDIA researchers have successfully trained a neural network to find
these jagged edges and perform high-quality anti-aliasing by
determining the best color for each pixel, and then apply proper
colors to create smoother edges and improve image quality. This
technique is known as Deep Learning Super Sample (DLSS). DLSS is like
an “Ultra AA” mode-- it provides the highest quality anti-aliasing
with fewer artifacts than other types of anti-aliasing.
DLSS requires a training set of full resolution frames of the aliased
images that use one sample per pixel to act as a baseline for
training. Another full resolution set of frames with at least 64
samples per pixel acts as the reference that DLSS aims to achieve."
https://developer.nvidia.com/rtx/ngx
At first I thought of sample as it is used in graphics, an intersection of channel and a pixel. But that really doesn't make any sense in this context, going from 1 channel to 64 channels ?
So I am thinking it is sample as in the statistics term but I don't understand how a static image could come up with 64 variations to compare to? Even going from FHD to 4K UHD is only 4 times the amount of pixels. Trying to parse that second paragraph I really can't make any sense of it.

16 bits × RGBA equals 64 samples per pixel maybe? They say at least, so higher accuracy could take as much as 32 bits × RGBA or 128 samples per pixel for doubles.

Related

How many neurons does the CNN input layer have?

In all the literature they say the input layer of a convnet is a tensor of shape (width, height, channels). I understand that a fully connected network has an input layer with the number of neurons same as the number of pixels in an image(considering grayscale image). So, my question is how many neurons are in the input layer of a Convolutional Neural Network? The below imageseems misleading(or I have understood it wrong) It says 3 neurons in the input layer. If so what do these 3 neurons represent? Are they tensors? From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)? Please correct me if I am wrong
It seems that you have misunderstood some of the terminology and are also confused that convolutional layers have 3 dimensions.
EDIT: I should make it clear that the input layer to a CNN is a convolutional layer.
The number of neurons in any layer is decided by the developer. For a fully connected layer, usually it is the case that there is a neuron for each input. So as you mention in your question, for an image, the number of neurons in a fully connected input layer would likely be equal to the number of pixels (unless the developer wanted to downsample at this point of something). This also means that you could create a fully connected input layer that takes all pixels in each channel (width, height, channel). Although each input is received by an input neuron only once, unlike convolutional layers.
Convolutional layers work a little differently. Each neuron in a convolutional layer has what we call a local receptive field. This just means that the neuron is not connected to the entire input (this would be called fully connected) but just some section of the input (that must be spatially local). These input neurons provide abstractions of small sections of the input data that when taken together over the whole input we call a feature map.
An important feature of convolutional layers is that they are spatially invariant. This means that they look for the same features across the entire image. After all, you wouldn't want a neural network trained on object recognition to only recognise a bicycle if it is in the bottom left corner of the image! This is achieved by constraining all of the weights across the local receptive fields to be the same. Neurons in a convolutional layer that cover the entire input and look for one feature are called filters. These filters are 2 dimensional (they cover the entire image).
However, having the whole convolutional layer looking for just one feature (such as a corner) would massively limit the capacity of your network. So developers add a number of filters so that the layer can look for a number of features across the whole input. This collection of filters creates a 3 dimensional convolutional layer.
I hope that helped!
EDIT-
Using the example the op gave to clear up loose ends:
OP's Question:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Answer:
First, it is important to note that it is typical (and often important) that the receptive fields overlap. So for an overlap/stride of 2 the 3x3 receptive field of the top left neuron (neuron A), the receptive field of the neuron to its right (neuron B) would also have a 3x3 receptive field, whose leftmost 3 connections could take the same inputs as the rightmost connections of neuron A.
That being said, I think it seems that you would like to visualise this so I will stick to your example were there is no overlap and will assume that we do not want any padding around the image. If there is an image of resolution 27x27, and we want 3 filters (this is our choice). Then each filter will have 81 neurons (9x9 2D grid of neurons). Each of these neurons would have 9 connections (corresponding to the 3x3 receptive field). Because there are 3 filters, and each has 81 neurons, we would have 243 neurons.
I hope that clears things up. It is clear to me that you are confused with your terminology (layer, filter, neuron, parameter etc.). I would recommend that you read some blogs to better understand these things and then focus on CNNs. Good luck :)
First, lets clear up the image. The image doesn't say there are exactly 3 neurons in the input layer, it is only for visualisation purposes. The image is showing the general architecture of the network, representing each layer with an arbitrary number of neurons.
Now, to understand CNNs, it is best to see how they will work on images.
Images are 2D objects, and in a computer are represented as 2D matrices, each cell having an intensity value for the pixel. An image can have multiple channels, for example, the traditional RGB channels for a colored image. So these different channels can be thought of as values for different dimensions of the image (in case of RGB these are color dimensions) for the same locations in the image.
On the other hand, neural layers are single dimensional. They take input from one end, and give output from the other. So how do we process 2D images in 1D neural layers? Here the Convolutional Neural Networks (CNNs) come into play.
One can flatten a 2D image into a single 1D vector by concatenating successive rows in one channel, then successive channels. An image of size (width, height, channel) will become a 1D vector of size (width x height x channel) which will then be fed into the input layer of the CNN. So to answer your question, the input layer of a CNN has as many neurons as there are pixels in the image across all its channels.
I think you have confusion on the basic concept of a neuron:
From my understanding of CNN shouldn't there be just one neuron of size (height, width, channel)?
Think of a neuron as a single computational unit, which cant handle more than one number at a time. So a single neuron cant handle all the pixels of an image at once. A neural layer made up of many neurons is equipped for dealing with a whole image.
Hope this clears up some of your doubts. Please feel free to ask any queries in the comments. :)
Edit:
So imagine we have (27 X 27) image. And let's say there are 3 filters each of size (3 X 3). So there are totally 3 X 3 X 3 = 27 parameters (W's). So my question is how are these neurons connected? Each of the filters has to iterate over 27 pixels(neurons). So at a time, 9 input neurons are connected to one filter neuron. And these connections change as the filter iterates over all pixels.
Is my understanding right? I am just trying to visualize CNNs as neurons with the connections.
A simple way to visualise CNN filters is to imagine them as small windows that you are moving across the image. In your case you have 3 filters of size 3x3.
We generally use multiple filters so as to learn different kinds of features from the same local receptive field (as michael_question_answerer aptly puts it) or simpler terms, our window. Each filters' weights are randomly initialised, so each filter learns a slightly different feature.
Now imagine each filter moving across the image, covering only a 3x3 grid at a time. We define a stride value which specifies how much the window shifts to the right, and how much down. At each position, the filter weights and image pixels at the window will give a single new value in the new volume created. So to answer your question, at an instance a total of 3x3=9 pixels are connected with the 9 neurons corresponding to one filter. The same for the other 2 filters.
Your approach to understanding CNNs by visualisation is correct. But you still need the brush up your basic understanding of terminology. Here are a couple of nice resources that should help:
http://cs231n.github.io/convolutional-networks/
https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
Hope this helps. Keep up the curiosity :)

Why are image sizes used in CNN usually certain numbers?

I am pretty new to computer vision and deep learning. I am always wondering why the dimensions of images fed in CNN models (or other models) usually in certain numbers like 28*28, 512*512, 256*256. Is there any reason for that? What will happen if I resize the images in arbitrary size? Will the performance get affected?
Most CNN architectures use images sizes that contain multiple factors of 2. That way you can downsample the images using MaxPooling multiple times without having to round the resolution to the closest integer.
512 -maxpool-> 256 -maxpool-> 128 -maxpool-> 64 -maxpool-> 32 ...
Sometimes you will come across resolutions where this doesn't work. U-Net for example uses resolutions of 572*572 where you could apply MaxPooling twice until you have to round the resolution. This is because U-Net uses unpadded convolutions where some of the image is cropped during the convolutional layers before MaxPooling is applied.
572 -conv-> 570 -conv-> 568 -maxpool-> 284 -conv-> 282 -conv-> 280 -maxpool-> 140 ...
I'm not aware of any papers that evaluated the impact of rounding resolutions during MaxPooling but my inutition is that it probably doesn't improve things. Personally, I used rounding a few times where the input resolution was given and didn't notice a difference in comparison to cropping parts of the images initially.
It is the input node size which is defined during the design of the network. The training is also done using the images of that size. So, if you want a consistent result you should resize your input images to the same size and also follow the same normalization rules that were considered in the training.
If you use a different size image, depending on which layers are used in the network, you may raise a mismatching size exception or you will have a different output size.

Why does googlenet (inception) work well on the ImageNet dataset?

Some people said that the reason that inception works well on the ImageNet dataset is that:the original images in the ImageNet dataset have different resolutions, and they are resized to the same size when they are used. So the inception which can deal with different resolutions is very suitable to the ImageNet. Whether this description is true? Can anyone give some more details explanations? I am really very confused to this. Thanks so much!
First of all, Deep Convolution Neural Nets , receive fix Input Image size(if by size,you mean,the number of pixels), so all images should be in the same size or dimension, this means same resolution. on the other hand if image resolution is high with a lot of details , result of any network gets better. Imagnet images are high resolution from fliker and resizing theme need no interpolation so resized image remain in a good shape.
Second , inception module main goal is dimension reduction, it means if we have 1X1 convolution, so coefficient in dimension calculation is ONE:
output_dim = (input_dim + 2 * pad_data[i] - kernel_extent) / stride_data[i] + 1;
Inception or in other word GoogLeNet, network is huge (more than 100 layer) and computationally impossible for many CPU's or even GPU's to go through all convolutions , so it need to reduce dimension.
You can use deeper AlexNet(with more layer) in Imagnet Data-set and i bet it will give you a good result but when you want to go deeper than 30 layer you should have a good strategy, like Inception.by the way , Imagnet data-set has over 5 million images (last time i checked), in the Deep nets more image == more accuracy

maximum stage and sprite size?

I'm making an action game and I'd like to know what should be the maximum size of the stage (mine is 660 x 500).
Also I'd like to know how big a game-sprite should be. Currently my biggest sprites have a size of 128 x 128 and I read somewhere on the internet that you should not make it bigger because of performance issues.
If you want to make e.g. big explosions with shockwaves even 128 x 128 does not look very big. What's the maximum size I can definitely use for sprites? I cannot find any real solution about this so I appreciate every hint I can get because this topic makes me a little bit nervous.
Cited from:
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/display/DisplayObject.html
http://kb2.adobe.com/cps/496/cpsid_49662.html
Display objects:
Flash Player 10 increased the maximum size of a bitmap to a maximum
pixel count of 16,777,215 (the decimal equivalent of 0xFFFFFF). There
is also a single-side limit of 8,191 pixels.
The largest square bitmap allowed is 4,095 x 4,095 pixels.
Content compiled to a SWF 9 target and running in Flash Player 10 or
later are still subject to Flash Player 9 limits (2880 x 2880 pixels).
In Flash Player 9 and earlier, the limitation is is 2880 pixels in
height and 2,880 pixels in width.
Stage
The usable stage size limit in Flash Player 10 is roughly 4,050 pixels
by 4,050 pixels. However, the usable size of the stage varies
depending on the settings of the QUALITY tag. In some cases, it's
possible to see graphic artifacts when stage size approaches the 3840
pixel range.
If you're looking for hard numbers, Jason's answer is probably the best you're going to do. Unfortunately, I think the only way to get a real answer for your question is to build your game and do some performance testing. The file size and dimensions of your sprite maps are going to effect RAM/CPU usage, but how much is too much is going to depend on how many sprites are on the stage, how they are interacting, and what platform you're deploying to.
A smaller stage will sometimes get you better performance (you'll tend to display fewer things), but what is more important is what you do with it. Also, a game with a stage larger than 800x600 may turn off potential sponsors (if you go that route with your game) because it won't fit on their portal site.
Most of my sprite sheets use tiles less than 64x64 pixels, but I have successfully implemented a sprite with each tile as large as 491x510 pixels. It doesn't have a super-complex animation, but the game runs at 60fps.
Bitmap caching is not necessarily the answer, but I found these resources to be highly informative when considering the impact of my graphics on performance.
http://help.adobe.com/en_US/as3/mobile/WS4bebcd66a74275c36c11f3d612431904db9-7ffc.html
and a video demo:
http://tv.adobe.com/watch/adobe-evangelists-paul-trani/optimizing-graphics/
Also, as a general rule, build your game so that it works first, then worry about optimization. A profiler can help you spot memory leaks and CPU spikes. FlashDevelop has one built in, or there's often a console in packages like FlashPunk, or the good old fashioned Windows Task Manager can be enough.
That might not be a concrete answer, but I hope it helps.

downscale large 1 bit tiff to 8 bit grayscale / 24 bit

Let's say i have a 100000x100000 1 bit (K channel) tiff with a dpi of 2000 and i want to downscale this to a dpi of 200. My resulting image would be 10000x10000 image. Does this mean that every 10 bits in the 1 bit image correspond to 1 pixel in the new image? By the way, i am using libtiff and reading the 1 bit tiff with tiffreadscanline. Thanks!
That means every 100 bits in the 1 bit image correspond to 1 pixel in the new image. You'd need to average the value over 10x10 1bit pixel area. For smoother greyscales, you'd better average over n bits where n is the bit depth of your target pixel, overlying the calculated area partially with neighbor areas (16x16px squares 10x10px apart, so their borders overlay, for a smooth 8-bit grayscale.)
It is important to understand why you want to downscale (because of output medium or because of file size?). As SF pointed out, colors/grayscale are somewhat interchangeable with resolution. If it is only about file size losless/lossy compression is also worth to look at..
The other thing is to understand a bit of the characteristics of your source image. For instance, if the source image is rasterized (as for newspaper images) you may get akward patterns because the dot-matrix is messed up. I have once tried to restore an old news-paper image, and I found it a lot of work. I ended up converting it to gray scale first before enhancing the image.
I suggest to experiment a bit with VIPS or Irfanview to find the best results (i.e. what is the effect of a certain resampling algorithm on your image quality). The reason for these programs (over i.e. Photoshop) is that you can experiment with GUI/command line while being aware of name/parameters of the algorithms behind it. With VIPS you can control most if not all parameters.
[Edit]
TiffDump (supplied with LibTiff binaries) is a valuable source of information. It will tell you about byte ordering etc. What I did was to start with a known image. For instance, LibTIFF.NET comes with many test images, including b&w (some with 0=black, some with 1=black). [/Edit]