How is 3D texture memory cached? - cuda

I have an application where 96% of the time is spent in 3D texture memory interpolation reads (red points in diagram).
My kernels are designed to do 1000~ memory reads on a line that crosses the texture memory arbitrarily, a thread per line (blue lines). This lines are densely packed, very close to each other, travelling in almost parallel directions.
The image shows the concept of what I am talking about. Imagine the image is a single "slice" from the 3D texture memory, e.g. z=24. The image is repeated for all z.
At the moment, I am executing threads just one line after the other, but I realized that I might be able to benefit from texture memory locality if I call adjacent lines in the same block, reducing the time for memory reads.
My questions are
if I have 3D texture with linear interpolation, how could I benefit most from the data locality? By running adjacent lines in the same block in 2D or adjacent lines in 3D (3D neighbors or just neighbors per slice)?
How "big" is the cache (or how can I check this in the specs)? Does it load e.g. the asked voxel and +-50 around it in every direction? This will directly relate with the amount of neighboring lines I'd put in each block!
How does the interpolation applies to texture memory cache? Is the interpolation also performed in the cache, or the fact that its interpolated will reduce the memory latency because it needs to be done in the text memory itself?
Working on a NVIDIA TESLA K40, CUDA 7.5, if it helps.

As this question is getting old, and no answers seem to exist to some of the questions I asked, I will give a benchmark answer, based on my research building the TIGRE toolbox. You can get the source code in the Github repo.
As the answer is based in the specific application of the toolbox, computed tomography, it means that my results are not necessarily true for all applications using texture memory. Additionally, my GPU (see above) its quite a decent one, so your mileage may vary in different hardware.
The specifics
It is important to note: this is a Cone Beam Computed Tomography applications. This means that:
The lines are more or less uniformily distributed along the image, covering most of it
The lines are more or less parallel with adjacent lines, and will predominantly be always in a single plane. E.g. They always are more or less horizontal, never vertical.
The sample rate on top of the lines is the same, meaning that adjacent lines will always sample the next point very close to each other.
All this information is important for memory locality.
Additionally, as said in the question, 96% of the time of the kernel is memory reading, so its safe to assume that the variation of the kernel times reported are due to changes in speed of memory reading.
The questions
If I have 3D texture with linear interpolation, how could I benefit most from the data locality? By running adjacent lines in the same block in 2D or adjacent lines in 3D (3D neighbors or just neighbors per slice)?
Once one gets a bit more experienced with the texture memory sees that the straightforward answer is: run as many as possible adjacent lines together. The closer to each other the memory reads are in image index, the better.
This effectively for tomography means running square detector pixel blocks. Packing rays (blue lines in the original image) together.
How "big" is the cache (or how can I check this in the specs)? Does it load e.g. the asked voxel and +-50 around it in every direction? This will directly relate with the amount of neighboring lines I'd put in each block!
While impossible to say, empirically I found that running smaller blocks is better. My results show that for a 512^3 image, with 512^2 rays, with a sample rate of ~2 samples/voxel, the block size:
32x32 -> [18~25] ms
16x16 -> [14~18] ms
8x8 -> [11~14] ms
4x4 -> [25~29] ms
The block sizes are effectively the size of a square adjacent rays that are computed together. E.g. 32x32 means that 1024 Xrays will be computed in parallel, adjacent to each other in a square 32x32 block. As the exact same operations are performed in each line, this means that the samples are taken about a 32x32 plane on the image, covering approximately 32x32x1 indexes.
It is predictable that at some point when reducing the size of the blocks the speed would get slow again, but this is at (at least for me) surprisingly low value. I think this hints that the memory cache loads relatively small chunks of data from the image.
This results shows an additional information not asked in the original question: what happens with out of bounds samples regarding speed. As adding any if condition to the kernel would significantly slow it down, the way I programmed the kernel is by starting sampling in a point in the line that is ensured to be out of the image, and stop in a similar case. This has been done by creating a fictional "sphere" around the image, and always sampling the same amount, independent of the angle between the image and the lines themselves.
If you see the times for each kernel that I have shown, you'd notice all of them are [t ~sqrt(2)*t], and I have checked that indeed the longer times are from when the angle between the lines and the image is multiples of 45 degrees, where more samples fall inside the image (texture).
This means that sampling out of the image index (tex3d(tex, -5,-5,-5)) is computationally free. No time is spend in reading out of bounds. Its better to read a lot of out of bounds points than to check if the points fall inside the image, as the if condition will slow the kernel and sampling out of bounds has zero cost.
How does the interpolation applies to texture memory cache? Is the interpolation also performed in the cache, or the fact that its interpolated will reduce the memory latency because it needs to be done in the text memory itself?
To test this, I ran the same code but with linear interpolation (cudaFilterModeLinear)and nearest neighbor interpolation (cudaFilterModePoint). As expected, improvement of speed is present when nearest neighbor interpolation is added. For 8x8 blocks with the previously mentioned image sizes, in my pc:
Linear -> [11~14] ms
Nearest -> [ 9~10] ms
The speedup is not massive but its significant. This hints, as expected, that the time that the cache takes in interpolating the data is measurable, so one needs to be aware of it when designing applications.

Related

Finding images in RAM dump

Extracting screenshots from RAM dumps
Some classical security / hacking challenges include having to analyze the dump of the physical RAM of a system. volatility does a great job at extracting useful information, including wire-view of the windows displayed at the time (using the command screenshot). But I would like to go further and find the actual content of the windows.
So I'd like to reformulate the problem as finding raw images (think matrix of pixels) in a large file. If I can do this, I hope to find the content of the windows, at least partially.
My idea was to rely on the fact that a row of pixels is similar to the next one. If I find a large enough number of lines of the same size, then I let the user fiddle around with an interactive tool and see if it decodes to something interesting.
For this, I would compute a kind of spectrogram. More precise a heatmap where the shade show how likely it is for the block of data #x to be part of an image of width y bytes, with x and y the axis of the spectrogram. Then I'd just have to look for horizontal lines in it. (See the examples below.)
The problem I have right now is to find a method to compute that "kind of spectrogram" accurately and quickly. As an order of magnitude, I would like to be able to find images of width 2048 in RGBA (8192 bytes per row) in a 4GB file in a few minutes. That means processing a few tens of MB per second.
I tried using FFT and autocorrelation, but they do not show the kind of accuracy I'm after.
The problem with FFT
Since finding the length of a mostly repeating pattern looks like finding a frequency, I tried to use a Fourier transform with 1 byte = 1 sample and plot the absolute value of the spectrum.
But the main problem is the period resolution. Since I'm interested in finding the period of the signal (the byte length of a row of pixels), I want to plot the spectrogram with period length on the y axis, not the frequency. But the way the discrete Fourier transform work is that it computes the frequencies multiple of 1/n (for n data points). Which gives me a very low resolution for large periods and a higher-than-needed resolution for short periods.
Here is a spectrogram computed with this method on a 144x90 RGB BMP file. We expect a peak at an offset 432. The window size for the FFT was 4320 bytes.
And the segment plot of the first block of data.
I calculated that if I need to distinguish between periods k and k+1, then I need a window size of roughly k². So for 8192 bytes, that makes the FFT window about 16MB. Which would be way too slow.
So the FFT computes too much information I don't need and not enough information I would need. But given a reasonable window size, it usually show a sharp peak at about the right period.
The problem with autocorrelation
The other method I tried is to use a kind of discrete autocorrelation to plot the spectrogram.
More exactly, what I compute is the cross-correlation between a block of data and half of it. And only compute it for the offsets where the small block is fully inside the large block. The size of the large block has to be twice larger than the max period to plot.
Here is an example of spectrogram computed with this method on the same image as before.
And the segment plot of the autocorrelation of the first block of data.
Altough it produces just the right amount of data, the value of the autocorrelation change slowly, thus not making a sharp peak for the right period.
Question
Is there a way to get both a sharp peak and around the correct period and enough precision around the large periods? Either by tweaking the afformentioned algorithms or by using a completely different one.
I can't judge much about the FFT part. From the title ("Finding images in RAM dump") it seems you are trying to solve a bigger problem and FFT is only a part of it, so let me answer on those parts where I have some knowledge.
analyze the RAM dump of a system
This sounds much like physical RAM. If an application takes a screenshot, that screenshot is in virtual RAM. This results in two problems:
a) the screenshot may be incomplete, because parts of it are paged out to disk
b) you need to perform a physical address to virtual address mapping in order to bring the bytes of the screenshot into correct order
find raw images from the memory dump
To me, the definition of what a raw image is is unclear. Any application storing an image will use an image file format. Storing only the data makes the screenshot useless.
In order to perform an FFT on the data, you should probably know whether it uses 24 bit per pixel or 32 bit per pixel.
I hope to find a screenshot or the content of the current windows
This would require an application that takes screenshots. You can of course hope. I can't judge about the likeliness of that.
rely on the fact that a row of pixels is similar to the next one
You probably hope to find some black on white text. For that, the assumption may be ok. If the user is viewing his holiday pictures, this may be different.
Also note that many values in a PC are 32 bit (Integer, Float) and 0x00000000 is a very common value. Your algorithm may detect this.
images of width 2048
Is this just a guess? Or would you finally brute-force all common screen sizes?
in RGBA
Why RGBA? A screenshot typically does not have transparency.
With all of the above, I wonder whether it wouldn't be more efficient to search for image signatures like JPEG, BMP or PNG headers in the dump and then analyze those headers and simply get the picture from the metadata.
Note that this has been done before, e.g. WinDbg has some commands in the ext debugger extension which is loaded by default
!findgifs
!findjpegs
!findjpgs
!findpngs

Moving memory around on device in CUDA

What is the fastest way to move data that is on the device around in CUDA?
What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.
Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:
make a kernel that does the memory copying
outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
move the memory to the host, create the new smaller matrix and send it back on the device
Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?
Edit:
I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:
A =[[XX XX]
[ XX XX ]
[XX XX ]]
with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).
If you have a matrix storage pattern that involves varying spacing between corresponding row elements (or corresponding column elements), none of the input transformation or striding capabilities of cublas will help, and none of the api striding-copy functions (such as cudaMemcpy2D) will help.
You'll need to write your own kernel to gather the data, before feeding it to cublasXgemm. This should be fairly trivial to do, if you have the locations of the incoming data elements listed in a vector or otherwise listed.

Why order of dimension makes big difference in performance?

To launch a CUDA kernel, we use dim3 to specify the dimensions, and I think the meaning of each dimension is opt to the user, for example, it could mean (width, height) or (rows, cols), which has the meaning reversed.
So I did an experiment with the CUDA sample in the SDK: 3_Imaging/convolutionSeparable, simply exchage .x and .y in the kernel function, and reverse the dimensions of blocks and threads used to launch the kernel, so the meaning changes from dim(width, height)/idx(x, y) to dim(rows, cols)/idx(row, col).
The result is the same, however, the performance decreases, the original one takes about 26ms, while the modified one takes about 40ms on my machine(SM 3.0).
My question is, what makes the difference? is (rows, cols) not feasible for CUDA?
P.S. I only modified convolutionRows, no convolutionColumns
EDIT: The change can be found here.
There are at least two potential consequences of your changes:
First, you are changing the memory access pattern to the main memory so the
access is as not coalesced as in the original case.
You should think about the GPU main memory in the same way as it was
a "CPU" memory, i.e., prefetching, blocking, sequential accesses...
techniques to applies in order to get performance.
If you want to know more about this topic, it is mandatory to read
this paper. What every programmer should know about memory.
You'll find an example a comparison between row and column major
access to the elements of a matrix there.
To get and idea on how important this is, think that most -if not
all- GPU high performance codes perform a matrix transposition
before any computation in order to achieve a more coalesced memory
access, and still this additional step worths in terms on
performance. (sparse matrix operations, for instance)
Second. This is more subtle, but in some scenarios it has a deep impact on the performance of a kernel; the launching configuration. It is not the same launching 20 blocks of 10 threads as launching 10 blocks of 20 threads. There is a big difference in the amount of resources a thread needs (shared memory, number of registers,...). The more resources a thread needs the less warps can be mapped on a single SM so the less occupancy... and the -most of the times- less performance.
This not applies to your question, since the number of blocks is equal to the number of threads.
When programming for GPUs you must be aware of the architecture in order to understand how that changes will modify the performance. Of course, I am not familiar with the code so there will be others factors among these two.

Kernel design for overlapping data, launch of a seperate warp

i have a question regarding a CFD application i am trying to implement according to a paper i found online. this might be somewhat of a beginner question, but here it goes:
the situation is as follows:
the 2D domain gets decomposed into tiles. Each of these tiles is being processed by a block of the kernel in question. The calculations being executed are highly suited for parallel execution, as they take into account only a handfull of their neighbours (it's a shallow water application). The tiles do overlap. Each tile has 2 extra cells on each side of the domain it's supposed to calculate the result to.
on the left you see 1 block, on the right 4, with the overlap that comes with it. grey are "ghost cells" needed for the calculation. light green is the domain each block actually writed back to global memory. needless to say the whole domain is going to have more than 4 tiles.
The idea per thread goes as following:
(1) copy data from global memory to shared memory
__synchthreads();
(2) perform some calculations
__synchthreads();
(3) perform some more calculations
(4) write back to globabl memory
for the cells in the green area, the Kernel is straight forward, you copy data according to your threadId, and calculate along using your neighbours in shared memory. Because of the nature of the data dependency this does however not suffice:
(1) has to be run on all cells (grey and green). No dependency.
(2) has to be run on all green cells, and the inner rows/columns of the grey cells. Depends on neighbouring data N,S,E and W.
(3) has to be run on all green cells. Depends on data from step (2) on neighbours N,S,E and W.
so here goes my question:
how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
i have considered using overlapping blocks as well (as opposed to just overlapping data), but this leads to another problem: the __synchthreads()-calls would have to be in conditional parts of the code.
Taking the kernel apart and having the steps (2)/(3) run in different kernels is not really an option either, as they produce intermediate results which can't all be written back to memory because of their number/size.
the author himself writes this (Brodtkorb et al. 2010, Efficient Shallow Water Simulations on GPUs:
Implementation, Visualization, Verification, and Validation):
When launching our kernel, we start by reading from global memory into on-chip shared memory. In addition to the interior cells of our block, we need to use data from two neighbouring cells in each direction to fulfill the data
dependencies of the stencil. After having read data into shared memory, we proceed by computing the one dimensional
fluxes in the x and y directions, respectively. Using the steps illustrated in Figure 1, fluxes are computed by storing
all values that are used by more than one thread in shared memory. We also perform calculations collectively within
one block to avoid duplicate computations. However, because we compute the net contribution for each cell, we have
to perform more reconstructions and flux calculations than the number of threads, complicating our kernel. This is
solved in our code by designating a single warp that performs the additional computations; a strategy that yielded a
better performance than dividing the additional computations between several warps.
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
You could do something like this:
// work that is done by all threads in a block
__syncthreads(); // may or may not be needed
if (threadIdx.x < 32) {
// work that is done only by the designated single warp
}
Although that's trivially simple, insofar as the last question in your question is considered, and the highlighted paragraph, I think it's very likely what they are referring to. I think it fits with what I'm reading here. Furthermore I don't know of any other way to restrict work to a single warp, except by using conditionals. They may also have chosen a single warp to take advantage of warp-synchronous behavior, which gets around the __syncthreads(); in conditional code issue you mention earlier.
so here goes my question: how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
Actually, I don't think any sequence of ordinary "if" statements, regardless of how cluttered, could solve the problem you describe.
A typical way to solve the dependency between steps 2 and 3 that you have already mentioned is to separate the work into two ( or more) kernels. You indicate that this is "not really an option", but as near as I can tell, what you're looking for is a global sync. Such a concept is not well-defined in CUDA except for the kernel launch/exit points. CUDA does not guarantee execution order among blocks in a grid. If your block calculations in step 3 depend on neighboring block calculations in step 2, then in my opinion, you definitely need a global sync, and your code is going to get ugly if you don't implement it with a kernel launch. Alternative methods such as using global semaphores or global block counters are, in my opinion, fragile and difficult to apply to general cases of widespread data dependence (where every block is dependent on neighbor calculations from the previous step).
If the neighboring calculations depend on only the data from a thin set of neighboring cells ("halo") , and not the whole neighboring block, and those cells can be computed independently, then it might be possible to have your block be expanded to include neighboring cells (i.e. overlap), effectively computing the halo regions twice between neighboring blocks, but you've indicated you've already considered and discarded this idea. However, I personally would want to consider the code in detail before accepting the idea that this would be rejected based entirely on difficulty with __syncthreads(); In my experience, people who say they can't use __syncthreads(); due to conditional code execution haven't accurately considered all the options, at a detail code level, to make __syncthreads(); work, even in the midst of conditional code.

How to find all frequencies in audio with discrete fourier transform?

I want to analyze some audio and decompose it as best as I can into sine waves. I have never used FFT before and am just doing some initial reading and about the concepts and available libraries, like FFTW and KissFFT.
I'm confused on this point... it sounds like the DFT/FFT will give you the sine amplitudes only at certain frequencies, multiples of a base frequency. For example, if I have audio sampled at the usual 44100 Hz, and I pick a chunk of say 256 samples, then that chuck could fit one cycle of 44100/256=172Hz, and the DFT will give me the sine amplitudes at 172, 172*2, 172*3, etc. Is that correct? How do you then find the strength at other frequencies? I'd like to see a spectrum all the way from 20Hz to about 15Khz, at about 1Hz increments.
Fourier decomposition allows you to take any function of time and describe it as a sum of sine waves each with different amplitudes and frequencies. If however you want to approach this problem using the DFT, you need to make sure you have sufficient resolution in the frequency domain in order to distinguish between different frequencies. Once you have that you can determine which frequencies are dominant in the signal and create a signal consisting of multiples sinewaves corresponding to those frequencies. You are correct in saying that with a sampling frequency of 44.1 kHz, only looking at 256 samples, the lowest frequency you will be able to detect in those 256 samples is a frequency of 172 Hz.
OBTAIN SUFFICIENT RESOLUTION IN THE FREQUENCY DOMAIN:
Amplitude values for frequencies "only at certain frequencies, multiples of a base frequency", is true for Fourier decomposition, NOT the DFT, which will have a frequency resolution of a certain increment. The frequency resolution of the DFT is related to the sampling rate and number of samples of the time-domain signal used to calculate the DFT. Reducing the frequency spacing will give you a better ability to distinguish between two frequencies close together and this can be done in two ways;
Decreasing the sampling rate, but this would move the periodic repetitions in frequency closer together. (Remember NyQuist theorem here)
Increase the number of samples which you use to calculate the DFT. If only the 256 samples are available, one can perform "zero padding" where 0-valued samples are appended to the end of the data, but there are some effects to this which needs to be considered.
HOW TO COME TO A CONCLUSION:
If you depict the frequency content of different audio signals into individual graphs, you will find that the amplitudes differ abit. This is because the individual signals will not be identical in sound, and there is always noise inherent in any signal (from the surroundings and the hardware itself). Therefore, what you want to do is to take the average of two or more DFT signals to remove noise and get a more accurate represention of the frequency content. Depending on your application, this may not be possible if the sound you are capturing is noticably changing rapidly over time (for example speech, or music). Averaging is thus only useful if all the signals to be averaged are pretty much equal in sound (individual seperate recordings of "the same thing"). Just to clarify, from, for example, four time-domain signals, you want to create four frequency domain signals (using a DFT method), and then calculate the average of the four frequency-domain signals into a single averaged frequency-domain signal. This will remove noise and give you a better representation of which frequencies are inherent in your audio.
AN ALTERNATIVE SOLUTION:
If you know that your signal is supposed to contain a certain number of dominant frequencies (not too many) and these are the only ones your are interesting in, then I would recommend that you use Pisarenko's harmonic decomposition (PHD) or Multiple signal classification (MUSIC, nice abbreviation!) to find these frequencies (and their corresponding amplitude values). This is less intensive computationally than the DFT. For example. if you KNOW the signal contains 3 dominant frequencies, Pisarenko will return the frequency values for these three, but keep in mind that the DFT reveals much more information, allowing you come to more conclusions.
Your initial assumption is incorrect. An FFT/DFT will not give you amplitudes only at certain discrete frequencies. Those discrete frequencies are only the centers of bins, each bin constituting a narrow-band filter with a main lobe of non-zero bandwidth, roughly a width or two of the FFT bin separation, depending on the window (rectangular, von Hann, etc.) applied before the FFT. Thus the amplitude of spectral content between bin centers will show up, but spread across multiple FFT result bins.
If the separation of key signals is large enough and the noise level is low enough, then you can interpolate the FFT results to examine frequencies between bin centers. You may need to use a high quality interpolator, such as a Sinc kernel.
If your signal separation is smaller or the noise level is higher, then you may need a longer window of data to feed a longer FFT to gather sufficient resolution information. An FFT window of length 256 at 44.1k sample rate is almost certainly just too short to gather sufficient information regarding spectral content below a few 100 Hz, if those are among the frequencies you would like to see examined, as they can't be separated cleanly from a DC bias (bin 0).
Unfortunately, there's a degree of uncertainty in identifying the frequencies in a fixed sample of a signal. If you use a short FFT, then there's no way to tell the difference between frequencies over a fairly wide range. If you use a long FFT to get higher resolution in the frequency domain, then you can't detect frequency changes as quickly. This is inherent in the math.
Off the top of my head: If you want a 15kHz range at 1Hz increments, you need a 15000 point FFT, which at 44.1kHz means you'll get a frequency plot three times per second. (I may be missing a factor of 2 in there as I can't recall whether the Nyquist limit means you actually want a 30kHz bandwidth.)
You may also be interested in the Short-time Fourier transform. It doesn't solve the fundamental trade-off problem but in practice may get you what you want.