Extracting screenshots from RAM dumps
Some classical security / hacking challenges include having to analyze the dump of the physical RAM of a system. volatility does a great job at extracting useful information, including wire-view of the windows displayed at the time (using the command screenshot). But I would like to go further and find the actual content of the windows.
So I'd like to reformulate the problem as finding raw images (think matrix of pixels) in a large file. If I can do this, I hope to find the content of the windows, at least partially.
My idea was to rely on the fact that a row of pixels is similar to the next one. If I find a large enough number of lines of the same size, then I let the user fiddle around with an interactive tool and see if it decodes to something interesting.
For this, I would compute a kind of spectrogram. More precise a heatmap where the shade show how likely it is for the block of data #x to be part of an image of width y bytes, with x and y the axis of the spectrogram. Then I'd just have to look for horizontal lines in it. (See the examples below.)
The problem I have right now is to find a method to compute that "kind of spectrogram" accurately and quickly. As an order of magnitude, I would like to be able to find images of width 2048 in RGBA (8192 bytes per row) in a 4GB file in a few minutes. That means processing a few tens of MB per second.
I tried using FFT and autocorrelation, but they do not show the kind of accuracy I'm after.
The problem with FFT
Since finding the length of a mostly repeating pattern looks like finding a frequency, I tried to use a Fourier transform with 1 byte = 1 sample and plot the absolute value of the spectrum.
But the main problem is the period resolution. Since I'm interested in finding the period of the signal (the byte length of a row of pixels), I want to plot the spectrogram with period length on the y axis, not the frequency. But the way the discrete Fourier transform work is that it computes the frequencies multiple of 1/n (for n data points). Which gives me a very low resolution for large periods and a higher-than-needed resolution for short periods.
Here is a spectrogram computed with this method on a 144x90 RGB BMP file. We expect a peak at an offset 432. The window size for the FFT was 4320 bytes.
And the segment plot of the first block of data.
I calculated that if I need to distinguish between periods k and k+1, then I need a window size of roughly k². So for 8192 bytes, that makes the FFT window about 16MB. Which would be way too slow.
So the FFT computes too much information I don't need and not enough information I would need. But given a reasonable window size, it usually show a sharp peak at about the right period.
The problem with autocorrelation
The other method I tried is to use a kind of discrete autocorrelation to plot the spectrogram.
More exactly, what I compute is the cross-correlation between a block of data and half of it. And only compute it for the offsets where the small block is fully inside the large block. The size of the large block has to be twice larger than the max period to plot.
Here is an example of spectrogram computed with this method on the same image as before.
And the segment plot of the autocorrelation of the first block of data.
Altough it produces just the right amount of data, the value of the autocorrelation change slowly, thus not making a sharp peak for the right period.
Question
Is there a way to get both a sharp peak and around the correct period and enough precision around the large periods? Either by tweaking the afformentioned algorithms or by using a completely different one.
I can't judge much about the FFT part. From the title ("Finding images in RAM dump") it seems you are trying to solve a bigger problem and FFT is only a part of it, so let me answer on those parts where I have some knowledge.
analyze the RAM dump of a system
This sounds much like physical RAM. If an application takes a screenshot, that screenshot is in virtual RAM. This results in two problems:
a) the screenshot may be incomplete, because parts of it are paged out to disk
b) you need to perform a physical address to virtual address mapping in order to bring the bytes of the screenshot into correct order
find raw images from the memory dump
To me, the definition of what a raw image is is unclear. Any application storing an image will use an image file format. Storing only the data makes the screenshot useless.
In order to perform an FFT on the data, you should probably know whether it uses 24 bit per pixel or 32 bit per pixel.
I hope to find a screenshot or the content of the current windows
This would require an application that takes screenshots. You can of course hope. I can't judge about the likeliness of that.
rely on the fact that a row of pixels is similar to the next one
You probably hope to find some black on white text. For that, the assumption may be ok. If the user is viewing his holiday pictures, this may be different.
Also note that many values in a PC are 32 bit (Integer, Float) and 0x00000000 is a very common value. Your algorithm may detect this.
images of width 2048
Is this just a guess? Or would you finally brute-force all common screen sizes?
in RGBA
Why RGBA? A screenshot typically does not have transparency.
With all of the above, I wonder whether it wouldn't be more efficient to search for image signatures like JPEG, BMP or PNG headers in the dump and then analyze those headers and simply get the picture from the metadata.
Note that this has been done before, e.g. WinDbg has some commands in the ext debugger extension which is loaded by default
!findgifs
!findjpegs
!findjpgs
!findpngs
Related
I have some time series data I'm looking at in Python that I know should follow a sine2 function, but for various reasons doesn't quite fit it. I'm taking an FFT of it and it has a fairly broad frequency spread, when it should be a very narrow single frequency. However, the errors causing this are quite consistent--if I take data again it matches very closely to the previous data set and gives a very similar FFT.
So I've been trying to come up with a way I can rescale the time axis of the data so that it is at a single frequency, and then apply this same rescaling to future data I collect. I've tried various filtering techniques to smooth the data or to cut frequencies from the FFT without much luck. I've also tried fitting a frequency varying sine2 to the data, but haven't been able to get a good fit (if I was able to, I would use the frequency vs time function to rescale the time axis of the original data so that it has a constant frequency and then apply the same rescaling to any new data I collect).
Here's a small sample of the data I'm looking at (the full data goes for a few hundred cycles). And the resulting FFT of the full data
Any suggestions would be greatly appreciated. Thanks!
I have an application where 96% of the time is spent in 3D texture memory interpolation reads (red points in diagram).
My kernels are designed to do 1000~ memory reads on a line that crosses the texture memory arbitrarily, a thread per line (blue lines). This lines are densely packed, very close to each other, travelling in almost parallel directions.
The image shows the concept of what I am talking about. Imagine the image is a single "slice" from the 3D texture memory, e.g. z=24. The image is repeated for all z.
At the moment, I am executing threads just one line after the other, but I realized that I might be able to benefit from texture memory locality if I call adjacent lines in the same block, reducing the time for memory reads.
My questions are
if I have 3D texture with linear interpolation, how could I benefit most from the data locality? By running adjacent lines in the same block in 2D or adjacent lines in 3D (3D neighbors or just neighbors per slice)?
How "big" is the cache (or how can I check this in the specs)? Does it load e.g. the asked voxel and +-50 around it in every direction? This will directly relate with the amount of neighboring lines I'd put in each block!
How does the interpolation applies to texture memory cache? Is the interpolation also performed in the cache, or the fact that its interpolated will reduce the memory latency because it needs to be done in the text memory itself?
Working on a NVIDIA TESLA K40, CUDA 7.5, if it helps.
As this question is getting old, and no answers seem to exist to some of the questions I asked, I will give a benchmark answer, based on my research building the TIGRE toolbox. You can get the source code in the Github repo.
As the answer is based in the specific application of the toolbox, computed tomography, it means that my results are not necessarily true for all applications using texture memory. Additionally, my GPU (see above) its quite a decent one, so your mileage may vary in different hardware.
The specifics
It is important to note: this is a Cone Beam Computed Tomography applications. This means that:
The lines are more or less uniformily distributed along the image, covering most of it
The lines are more or less parallel with adjacent lines, and will predominantly be always in a single plane. E.g. They always are more or less horizontal, never vertical.
The sample rate on top of the lines is the same, meaning that adjacent lines will always sample the next point very close to each other.
All this information is important for memory locality.
Additionally, as said in the question, 96% of the time of the kernel is memory reading, so its safe to assume that the variation of the kernel times reported are due to changes in speed of memory reading.
The questions
If I have 3D texture with linear interpolation, how could I benefit most from the data locality? By running adjacent lines in the same block in 2D or adjacent lines in 3D (3D neighbors or just neighbors per slice)?
Once one gets a bit more experienced with the texture memory sees that the straightforward answer is: run as many as possible adjacent lines together. The closer to each other the memory reads are in image index, the better.
This effectively for tomography means running square detector pixel blocks. Packing rays (blue lines in the original image) together.
How "big" is the cache (or how can I check this in the specs)? Does it load e.g. the asked voxel and +-50 around it in every direction? This will directly relate with the amount of neighboring lines I'd put in each block!
While impossible to say, empirically I found that running smaller blocks is better. My results show that for a 512^3 image, with 512^2 rays, with a sample rate of ~2 samples/voxel, the block size:
32x32 -> [18~25] ms
16x16 -> [14~18] ms
8x8 -> [11~14] ms
4x4 -> [25~29] ms
The block sizes are effectively the size of a square adjacent rays that are computed together. E.g. 32x32 means that 1024 Xrays will be computed in parallel, adjacent to each other in a square 32x32 block. As the exact same operations are performed in each line, this means that the samples are taken about a 32x32 plane on the image, covering approximately 32x32x1 indexes.
It is predictable that at some point when reducing the size of the blocks the speed would get slow again, but this is at (at least for me) surprisingly low value. I think this hints that the memory cache loads relatively small chunks of data from the image.
This results shows an additional information not asked in the original question: what happens with out of bounds samples regarding speed. As adding any if condition to the kernel would significantly slow it down, the way I programmed the kernel is by starting sampling in a point in the line that is ensured to be out of the image, and stop in a similar case. This has been done by creating a fictional "sphere" around the image, and always sampling the same amount, independent of the angle between the image and the lines themselves.
If you see the times for each kernel that I have shown, you'd notice all of them are [t ~sqrt(2)*t], and I have checked that indeed the longer times are from when the angle between the lines and the image is multiples of 45 degrees, where more samples fall inside the image (texture).
This means that sampling out of the image index (tex3d(tex, -5,-5,-5)) is computationally free. No time is spend in reading out of bounds. Its better to read a lot of out of bounds points than to check if the points fall inside the image, as the if condition will slow the kernel and sampling out of bounds has zero cost.
How does the interpolation applies to texture memory cache? Is the interpolation also performed in the cache, or the fact that its interpolated will reduce the memory latency because it needs to be done in the text memory itself?
To test this, I ran the same code but with linear interpolation (cudaFilterModeLinear)and nearest neighbor interpolation (cudaFilterModePoint). As expected, improvement of speed is present when nearest neighbor interpolation is added. For 8x8 blocks with the previously mentioned image sizes, in my pc:
Linear -> [11~14] ms
Nearest -> [ 9~10] ms
The speedup is not massive but its significant. This hints, as expected, that the time that the cache takes in interpolating the data is measurable, so one needs to be aware of it when designing applications.
I am trying to come up with an efficient way to characterize two narrowband tones separated by about 900kHz (one at around 100kHZ and one at around 1MHz once translated to baseband). They don't move much in freq over time but may have amplitude variations we want to monitor.
Each tone is roughly about 100Hz wide and we are required to characterize these two beasts over long periods of time down to a resolution of about 0.1 Hz. The samples are coming in at over 2M Samples/sec (TBD) to adequately acquire the highest tone.
I'm trying to avoid (if possible) doing brute force >2MSample FFTs on the data once a second to extract frequency domain data. Is there an efficient approach? Something akin to performing two (much) smaller FFTs around the bands of interest? Ive looked at Goertzel and chirp z methods but I am not certain it helps save processing.
Something akin to performing two (much) smaller FFTs around the bands of interest
There is, it's called Goertzel, and is kind of the FFT for single bins, and you already have looked at it. It will save you CPU time.
Anyway, there's no reason to do a 2M-point FFT; first of all, you only want a resolution of about 1/20 the sampling rate, hence, a 20-point FFT would totally do, and should be pretty doable for your CPU at these low rates; since you don't seem to care about phase of your tones, FFT->complex_to_mag.
However, there's one thing that you should always do: look at your signal of interest, and decimate down to the rate that fits exactly that. Since GNU Radio's filters are implemented cleverly, the filter itself will only run at the decimated rate, and you can spend the CPU cycles saved on a better filter.
Because a direct decimation from 2MHz to 100Hz (decimation: 20000) will really have an ugly filter length, you should do this multi-rated:
I'd try first decimating by 100, and then in a second step by 100, leaving you with 200Hz observable spectrum. The xlating fir filter blocks will let you use a simple low-pass filter (use the "Low-Pass Filter Taps" block to define a variable that contains such taps) as a band-selector.
I have read the following sentence in a documents talk about the drawback of short time Fourier transform, and he is said :
the drawback is that once you choose a particular size for the time
window, that window is the same for all frequencies
So what is the relation between frequencies and the size of window. If we have a high frequency component in a part of a signal how will not be able to detect this frequency if the size of the window is not smaller/bigger enough?
Furthermore, he is said about wavelet transform :
Wavelet analysis allows the use of long time intervals where we want
more precise low-frequency information, and shorter regions where we
want high-frequency information
I feel that the answer has a relation with nyquest rate somehow
For sampled data, the number of orthogonal sinusoidal FT basis vectors below half the sample rate increases with the length of the STFT window, and the bandwidth of each DFT/FFT result bin for each basis vector decreases. If the window is too short, then each DFT result might detect not only your high frequency component of interest, but a greater bandwidth of adjacent frequencies.
I want to setup a big matrix on my GPU to solve the according equation system with CULA.
Some numbers for you, to understand the problem:
big matrix: 400x400
small matrices: 200x200
Now I want to copy every quarter (100x100) of the small matrix to a specific part of the second matrix.
I found two possible but obviously slow examples: cublasSetMatrix and cublasGetMatrix support the specification of a leading dimension, so I could put the parts, where I want them, but have to copy the matrix back to the host.
The other example would be cudaMemcpy, which doesn't support leading dimensions. Here I could copy every single row/column (at the moment I am unsure what is used by this routine, data comes from Fortran) by hand. But this way, I should get a big overhead...
Is there a better way than writing my own kernel, to copy the matrix?
You may revise your Q. I guess you are finding a way that can both change the leading dimension and do D2Dcpy.
There is a routine cudaMemcpy2D() can do that as shown in here.