why DCT transform is preferred over other transforms in video/image compression - fft

I went through how DCT (discrete cosine transform) is used in image and video compression standards.
But why DCT only is preferred over other transforms like dft or dst?

Because cos(0) is 1, the first (0th) coefficient of DCT-II is the mean of the values being transformed. This makes the first coefficient of each 8x8 block represent the average tone of its constituent pixels, which is obviously a good start. Subsequent coefficients add increasing levels of detail, starting with sweeping gradients and continuing into increasingly fiddly patterns, and it just so happens that the first few coefficients capture most of the signal in photographic images.
Sin(0) is 0, so the DSTs start with an offset of 0.5 or 1, and the first coefficient is a gentle mound rather than a flat plain. That is unlikely to suit ordinary images, and the result is that DSTs require more coefficients than DCTs to encode most blocks.
The DCT just happens to suit. That is really all there is to it.

When performing image compression, our best bet is to perform the KLT or the Karhunen–Loève transform as it results in the least possible mean square error between the original and the compressed image. However, KLT is dependent on the input image, which makes the compression process impractical.
DCT is the closest approximation to the KL Transform. Mostly we are interested in low frequency signals so only even component is necessary hence its computationally feasible to compute only DCT.
Also, the use of cosines rather than sine functions is critical for compression as fewer cosine functions are needed to approximate a typical signal (See Douglas Bagnall's answer for further explanation).
Another advantage of using cosines is the lack of discontinuities. In DFT, since the signal is represented periodically, when truncating representation coefficients, the signal will tend to "lose its form". In DCT, however, due to the continuous periodic structure, the signal can withstand relatively more coefficient truncation but still keep the desired shape.

The DCT of a image macroblock where the top and bottom and/or the left and right edges don't match will have less energy in the higher frequency coefficients than a DFT. Thus allowing greater opportunities for these high coefficients to be removed, more coarsely quantized or compressed, without creating more visible macroblock boundary artifacts.

DCT is preferred over DFT (Discrete Fourier Transformation) and KLT (Karhunen-Loeve Transformation)
1. Fast algorithm
2. Good energy compaction
3. Only real coefficients

Related

Why W_q matrix in torch.nn.MultiheadAttention is quadratic

I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim  – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.

fft output show unexpected symmetry

I am running a cfft on a signal. The output seems to show symmetry. I know that
an fft is symmetrical, but the code
arm_cfft_f32(&arm_cfft_sR_f32_len512, &FFTBuf[0], 0, 1);
arm_cmplx_mag_f32(&FFTBuf[0], &FFTMagBuf[0], FFT_LEN);
accounts for this as the FFTMagBuf is Half the length of the Input array.
The output though, still appears to show symmetry
[1]https://imgur.com/K0uMDAm
arrows point to my whistle, which shows nicely, surrounded by much noise.
the middle one is probably a harmonic(my whistling is crap). but left right symmetry is noticeable.
I am using an stm32f4 disco board, and the samples are from the on-board mems microphone, and each block of samples(in this case 1024, to give an fft of 512 length) is passed through a hann window.
I am using a modified version of tony dicola's spectrogramui.py for visualization.
According to the documentation arm_cmplx_mag_f32 computes the magnitude of a complex signal. That's why FFTMagBuf has to be half the size of FFTBuf: both arrays hold real numbers but, the complex samples are made of two reals. It's unrelated to the simmetry of the FFT.
So, the output signal has exactly the same number of samples as the input.
That is, you compute the complex FFT of a real signal, which has some kind of symmetry (you need to account for complex conjugation too), and you take the magnitude, which is symmetric. Of course, the plot is then symmetric too.

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks
This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

Motion Vectors and DCT residuals, are they related or independent?

I am working on a novel technique that uses already encoded H264 motion vectors from a pre-encoded video.
I need to know how the motion vectors and residuals are related. I need some very specific answers that I can't find answered anywhere else:
Are the motion vectors forward, or backward? I mean, does the vector indicate where the current 4x4 or 8x8, 8x4 .... block will be in the next frame (forward). Or is it the opposite? (That in the block it is indicated where that block comes from), (backwards).
In the case a block has multiple references (I don't know if that is even possible). How are those references added together? Mean? Weighted?
How is the residual error being compensated, per block (4x8, 8x4, etc)? Ignoring the sub blocks, and just partitioning the image in 8x8 chunks?
My ultimate goal, is to know from the video feed the "accuracy" of each motion vector. I can only do that with backwards prediction, and if the DCT residuals are per block. In that case I can measure the accuracy of the motion vector estimation by measuring the amount of residual error of that block.
Thanks in advance!!
PD: Reading trough the 800 pages of H264 is not easy task....
The H264 standard is your friend. Also get the books by Ian Richardson, a bit more readable than the standard (but only a bit :)
"Are the motion vectors forward, or backward?" - they are backward. The MV for a block points to where that block came from.
"In the case a block has multiple references (I don't know if that is even possible). How are those references added together?" - it is possible, check out weightb and weightp options for x264. Can have up to two references, the explicit weights are encoded in the stream (I think as deltas from the neighbor weights, so usually zeros - but don't quote me on that; also I think whether weights are used is a flag somewhere, if not used the weights are equal by default)
"How is the residual error being compensated" - depends on the macroblock partitioning mode and transform size. The MVs are for each partition, the residuals are for the transform size tiled into the partition (so if a 16x16 is partitioned into two 16x8 and the transform is 8x8, each partition gets two transforms; if the transform is 4x4 each partition gets (16/4)x(8/4)=8 transforms).
For experiments, you can change encoder settings to turn off B-frames and weighted P-frames, and also restrict the partitioning mode to not partition (ie 16x16 only). This allows much easier way to try different motion vectors :)

How to represent stereo audio data for FFT

How should stereo (2 channel) audio data be represented for FFT? Do you
A. Take the average of the two channels and assign it to the real component of a number and leave the imaginary component 0.
B. Assign one channel to the real component and the other channel to the imag component.
Is there a reason to do one or the other? I searched the web but could not find any definite answers on this.
I'm doing some simple spectrum analysis and, not knowing any better, used option A). This gave me an unexpected result, whereas option B) went as expected. Here are some more details:
I have a WAV file of a piano "middle-C". By definition, middle-C is 260Hz, so I would expect the peak frequency to be at 260Hz and smaller peaks at harmonics. I confirmed this by viewing the spectrum via an audio editing software (Sound Forge). But when I took the FFT myself, with option A), the peak was at 520Hz. With option B), the peak was at 260Hz.
Am I missing something? The explanation that I came up with so far is that representing stereo data using a real and imag component implies that the two channels are independent, which, I suppose they're not, and hence the mess-up.
I don't think you're taking the average correctly. :-)
C. Process each channel separately, assigning the amplitude to the real component and leaving the imaginary component as 0.
Option B does not make sense. Option A, which amounts to convert the signal to mono, is OK (if you are interested in a global spectrum).
Your problem (double freq) is surely related to some misunderstanding in the use of your FFT routines.
Once you take the FFT you need to get the Magnitude of the complex frequency spectrum. To get the magnitude you take the absolute of the complex spectrum |X(w)|. If you want to look at the power spectrum you square the magnitude spectrum, |X(w)|^2.
In terms of your frequency shift I think it has to do with you setting the imaginary parts to zero.
If you imagine the complex Frequency spectrum as a series of complex vectors or position vectors in a cartesian space. If you took one discrete frequency bin X(w), there would be one real component representing its direction in the real axis (x -direction), and one imaginary component in the in the imaginary axis (y - direction). There are four important values about this discrete frequency, 1. real value, 2. imaginary value, 3. Magnitude and, 4. phase. If you just take the real value and set imaginary to 0, you are setting Magnitude = real and phase = 0deg or 90deg. You have hence forth modified the resulting spectrum, and applied a bias to every frequency bin. Take a look at the wiki on Magnitude of a vector, also called the Euclidean norm of a vector to brush up on your understanding. Leonbloy was correct, but I hope this was more informative.
Think of the FFT as a way to get information from a single signal. What you are asking is what is the best way to display data from two signals. My answer would be to treat each independently, and display an FFT for each.
If you want a really fast streaming FFT you can read about an algorithm I wrote here: www.depthcharged.us/?p=176