PSNR calculation in H.264 packet loss - h.264

I'm streaming H.264(libx264) videos using RTP. I would like to compare the different error resiliency methods in h.264. In some research papers, psnr was used to compare them. So I would like to know how to calculate the psnr of the h.264 video during streaming.

To calculate PSNR, you must compare two frames, so first step is to make sure you have a copy of the source video. Next, you must be able to match the frames 1:1. so if a frame is droped, you need to compare the source frame to the previous streamed frame. This may be difficult if the timestamps do not match (They may be modified by the RTP server). Next decode each frame into its YUV channels. the PSNR of the channels needs to be calculated independently. You can average the three PSNR values at the end, but this puts too much weight on the U,V channels. I recommend just using the Y channel, as it is most important. And since you are measuring packet loss, the values will be strongly correlated anyway.
Next calculate your mean squared error like so:
int64_t mse = 0;
for(int x = 0 ; x < frame.width ; ++x ) {
for(int y = 0 ; y < frame.height ; ++y ) {
mse += pow( source_frame[x][y] - streamed_frame[x][y], 2 );
}}
mse *= 1.0/(frame.width*frame.height);
And finally:
/* use 219 instead of 255 if you know your colorspace is BT.709 */
double psnr = 20 * log10( 255 ) - 10 * log10( mse );
You can average the per frame PSNRs together to get a full stream PSNR.

Related

Understanding an wav file exported from a daw

I have generated a tone from Audacity at 440Hz with Amplitude as 1 for 1 sec like this:
I understand that this is going to create 440 peaks in 1 sec with Amplitude as 1.
Here i see that its a 32 bit file with 44100Hz is the sample rate which means there are 44100 samples per sec. The Amplitude is 1 which is as expected because that is what i chose.
What i dont understand is, what is the unit of this Amplitude? When right-clicked it shows linear(-1 to +1)
There is an option to select dB it shows (0 to -60 to 0) which i dont understand how is this converted!
now when i use this wav file in the python scipy to read the wav and get values of time and amplitude
How to match or get the relation between what i generated vs what i see when i read a wav file?
The peak is amplitude is 32767.987724003342 Frequency 439.99002267573695
The code i have used in python is
wavFileName ="440Hz.wav"
sample_rate, sample_data = wavfile.read(wavFileName)
print ("Sample Rate or Sampling Frequency is", sample_rate," Hz")
l_audio = len(sample_data.shape)
print ("Channels", l_audio,"Audio data shape",sample_data.shape,"l_audio",l_audio)
if l_audio == 2:
sample_data = sample_data.sum(axis=1) / 2
N = sample_data.shape[0]
length = N / sample_rate
print ("Duration of audio wav file in secs", length,"Number of Samples chosen",sample_data.shape[0])
time =np.linspace(0, length, sample_data.shape[0])
sampling_interval=time[1]-time[0]
notice in audacity when you created the one second of audio with aplitude choice of 1.0 right before saving file it says signed 16 bit integer so amplitude from -1 to +1 means the WAV file in PCM format stores your raw audio by varying signed integers from its max negative to its max positive which since 2^16 is 65536 then signed 16 bit int range is -32768 to 32767 in other words from -2^15 to ( +2^15 - 1 ) ... to better plot I suggest you choose a shorter time period much smaller than one second lets say 0.1 seconds ... once your OK with that then boost it back to using a full one second which is hard to visualize on a plot due to 44100 samples
import os
import scipy.io
import scipy.io.wavfile
import numpy as np
import matplotlib.pyplot as plt
myAudioFilename = '/home/olof/sine_wave_440_Hz.wav'
samplerate, audio_buffer = scipy.io.wavfile.read(myAudioFilename)
duration = len(audio_buffer)/samplerate
time = np.arange(0,duration,1/samplerate) #time vector
plt.plot(time,audio_buffer)
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.title(myAudioFilename)
plt.show()
here is 0.1 seconds of 440 Hz using signed 16 bit notice the Y axis of Amplitude range matches above mentioned min to max signed integer value range

Convert FFT to PCM

I have some FFT data, 257 dimensions, every 10 ms, with 121 frames, i.e. 1.21 secs. I think the first dimension is probably something else and the remaining are the FFT coefficients, I guess.
It's probably just spectogram data. From a comment about the FFT data, sqrt10 and mean-variance-normalization might have been applied on it.
From there, I want to calculate back some PCM signal for 44.1 Hz so I can play the sound. I asked the same question in a more mathematical way here but maybe StackOverflow is a better place because I actually want to implement this.
I also asked the same question about the theory here on DSP SE.
How would I do that? Maybe I need some more information (which I have to find out somehow) - which? Maybe these missing information can be intelligently guessed somehow?
This question is both about the theory and practical implementation. The implementation is trivial I guess. But a concrete example in some language would be nice to help understanding the theory. Maybe C++ with FFTW? I skipped through the FFTW docs but I fail to understand all the terminology and some background, e.g. here. Why is it from complex to real or the other way, I only want real to real. What are those REDFT? What's a DCT, DFT, DST? FFTW_HC2R?
I read all the FFT data, i.e. 121 * 257 floats, into a vector freq_bins.
std::vector<float32_t> freq_bins; // FFT data
int freq_bins_count = 257;
size_t len = 121;
std::vector<float32_t> pcm; // output, PCM data
int N = freq_bins_count;
std::vector<double> out(N), orig_in(N);
// inspiration: https://stackoverflow.com/questions/2459295/invertible-stft-and-istft-in-python/6891772#6891772
for(int f = 0; f < len; ++f) {
size_t pos = freq_bins_count * f;
for(int i = 0; i < N; ++i)
out[i] = pow(freq_bins[pos + i] + offset, 10); // fft was sqrt10 + mvn
fftw_plan q = fftw_plan_r2r_1d(N, &out[0], &orig_in[0], FFTW_REDFT00, FFTW_ESTIMATE);
fftw_execute(q);
fftw_destroy_plan(q);
// naive overlap-and-add
auto start_frame = size_t(f * dt * sampleRate);
for(int i = 0; i < N; ++i) {
sample_t frame = orig_in[i] * scale / (2 * (N - 1));
size_t idx = start_frame + i;
while(idx >= pcm.size())
pcm.push_back(0);
pcm[idx] += frame;
}
}
But this is wrong, I guess. I just get garbage out.
Related might be this question. Or this.
If the data you are have is real then the data you have is most probably spectrogram data and if the data you are receiving is complex then you most probably have raw short time fourier transform (STFT) data (See the diagram on this post to see how STFT/spectrogram data is produced). Spectrogram data is produced by taking the magnitude squared of STFT data and is thus not invertible because all the phase information in the audio signal has been lost but raw STFT data is invertible so if that is what you have then you might want to look for a library that performs the inverse STFT function and try using that.
As for the question of what the FFT dimensions in your data represent, I reckon the 257 data points you are receiving every 10ms are the result of a 512 point FFT being used in the STFT process.The first sample is the 0Hz frequency and the rest of the 256 data points are one half of the FFT spectrum (the other half of the FFT data has been discarded because the input to the FFT is real and so one half of the FFT data is simply the complex conjugate of the other half).
In addition to this, I would like to point out that just because you are receiving FFT data every 10ms 121 times does not mean the audio signal is 1.21s.The STFT is usually produced by using overlapping windows so your audio signal is might be shorter than 1.21s.
You'd simply push that data you have through the inverse fourier transform. All FFT libraries offer forward and backward transformation functions.

How to "zero-phase-adjust" DFT output?

I understand the complex output of a DFT contains both "amplitude" and "phase" information at discrete frequencies.
Amplitude[n] = sqrt((r[n]*r[n]) + (i[n]*i[n]))
Phase[n] = (atan2(i[n],r[n]))
Frequency[n] = n * (sample_rate / (fft_input_length / 2))
It seems that I should be able to use the frequency, amplitude, and phase information to calculate the amplitude of each output bin as if the input at the corresponding frequency had a zero-phase alignment in the FFT input. But I am drawing a blank.
Hmm, digging deeper into my problem I discovered that the imaginary potion of the FFT output is always 0.0 regardless of the input. So I am guessing my code is flawed or the algorithm is not what I need.
If you want to rotate all DFT result bins to a phase of zero with reference to the start (sample 0): set r[n] = amplitude[n], i[n] = 0; make sure r[n] is symmetric over the full DFT length if you want strictly real data; and compute the IDFT if needed.

any rules of thumb how to smooth FFT spectrum to prevent artifacts when hand-tweaking?

I've got a FFT magnitude spectrum and I want to create a filter from it that selectively passes periodic noise sources (e.g. sinewave spurs) and zero's out the frequency bins associated with the random background noise. I understand sharp transitions in the freq domain will create ringing artifacts once this filter is IFFT back to the time domain... and so I'm wondering if there are any rules of thumb how to smooth the transitions in such a filter to avoid such ringing.
For example, if the FFT has 1M frequency bins, and there are five spurs poking out of the background noise floor, I'd like to zero all bins except the peak bin associated with each of the five spurs. The question is how to handle the neighboring spur bins to prevent artifacts in the time domain. For example, should the the bin on each side of a spur bin be set to 50% amplitude? Should two bins on either side of a spur bin be used (the closest one at 50%, and the next closest at 25%, etc.)? Any thoughts greatly appreciated. Thanks!
I like the following method:
Create the ideal magnitude spectrum (remembering to make it symmetrical about DC)
Inverse transform to the time domain
Rotate the block by half the blocksize
Apply a Hann window
I find it creates reasonably smooth frequency domain results, although I've never tried it on something as sharp as you're suggesting. You can probably make a sharper filter by using a Kaiser-Bessel window, but you have to pick the parameters appropriately. By sharper, I'm guessing maybe you can reduce the sidelobes by 6 dB or so.
Here's some sample Matlab/Octave code. To test the results, I used freqz(h, 1, length(h)*10);.
function [ht, htrot, htwin] = ArbBandPass(N, freqs)
%# N = desired filter length
%# freqs = array of frequencies, normalized by pi, to turn into passbands
%# returns raw, rotated, and rotated+windowed coeffs in time domain
if any(freqs >= 1) || any(freqs <= 0)
error('0 < passband frequency < 1.0 required to fit within (DC,pi)')
end
hf = zeros(N,1); %# magnitude spectrum from DC to 2*pi is intialized to 0
%# In Matlabs FFT, idx 1 -> DC, idx 2 -> bin 1, idx N/2 -> Fs/2 - 1, idx N/2 + 1 -> Fs/2, idx N -> bin -1
idxs = round(freqs * N/2)+1; %# indeces of passband freqs between DC and pi
hf(idxs) = 1; %# set desired positive frequencies to 1
hf(N - (idxs-2)) = 1; %# make sure 2-sided spectrum is symmetric, guarantees real filter coeffs in time domain
ht = ifft(hf); %# this will have a small imaginary part due to numerical error
if any(abs(imag(ht)) > 2*eps(max(abs(real(ht)))))
warning('Imaginary part of time domain signal surprisingly large - is the spectrum symmetric?')
end
ht = real(ht); %# discard tiny imag part from numerical error
htrot = [ht((N/2 + 1):end) ; ht(1:(N/2))]; %# circularly rotate time domain block by N/2 points
win = hann(N, 'periodic'); %# might want to use a window with a flatter mainlobe
htwin = htrot .* win;
htwin = htwin .* (N/sum(win)); %# normalize peak amplitude by compensating for width of window lineshape

How do I play back a WAV in ActionScript?

Please see the class I have created at http://textsnip.com/see/WAVinAS3 for parsing a WAVE file in ActionScript 3.0.
This class is correctly pulling apart info from the file header & fmt chunks, isolating the data chunk, and creating a new ByteArray to store the data chunk. It takes in an uncompressed WAVE file with a format tag of 1. The WAVE file is embedded into my SWF with the following Flex embed tag:
[Embed(source="some_sound.wav", mimeType="application/octet-stream")]
public var sound_class:Class;
public var wave:WaveFile = new WaveFile(new sound_class());
After the data chunk is separated, the class attempts to make a Sound object that can stream the samples from the data chunk. I'm having issues with the streaming process, probably because I'm not good at math and don't really know what's happening with the bits/bytes, etc.
Here are the two documents I'm using as a reference for the WAVE file format:
http://www.lightlink.com/tjweber/StripWav/Canon.html
https://ccrma.stanford.edu/courses/422/projects/WaveFormat/
Right now, the file IS playing back! In real time, even! But...the sound is really distorted. What's going on?
The problem is in the onSampleData handler.
In your wav file, the amplitudes are stored as signed shorts, that is 16 bit integers. You are reading them as 32 bit signed floats. Integers and floats are represented differently in binary, so that will never work right.
Now, the player expects floats. Why did they use floats? Don't know for sure, but one good reason is that it allows the player to accept a normalized value for each sample. That way you don't have to care or know what bitdept the player is using: the max value is 1, and the min value is -1, and that's it.
So, your problem is you have to convert your signed short to a normalized signed float. A short takes 16 bits, so it can store 2 ^ 16 (or 65,536) different values. Since it's signed and the sign takes up one bit, the max value will be 2 ^ 15. So, you know your input is the range -32,768 ... 32,767.
The sample value is normalized and must be in the range -1 ... 1, on the other hand.
So, you have to normalize your input. It's quite easy. Just take the read value and divide it by the max value, and you have your input amplitude converted to the range -1 ... 1.
Something like this:
private function onSampleData(evt:SampleDataEvent):void
{
var amplitude:int = 0;
var maxAmplitude:int = 1 << (bitsPerSample - 1); // or Math.pow(2, bitsPerSample - 1);
var sample:Number = 0;
var actualSamples:int = 8192;
var samplesPerChannel:int = actualSamples / channels;
for ( var c:int = 0; c < samplesPerChannel ; c++ ) {
var i:int = 0;
while(i < channels && data.bytesAvailable >= 2) {
amplitude = data.readShort();
sample = amplitude / maxAmplitude;
evt.data.writeFloat(sample);
i++;
}
}
}
A couple of things to note:
maxAmplitude could (and probably
should) be calculated when you read
the bitdepth. I'm doing it in the
method just so you can see it in the
pasted code.
Although maxAmplitude is calculated
based on the read bitdepth and thus
will be correct for any bitdepth,
I'm reading shorts in the loop, so
if your wav file happens to use a
different bitdepth, this function
will not work correctly. You could
add a switch and read the necessary
ammount of data (i.e., readInt if
bitdepth is 32). However, 16 bits is
such a widely used standard, that I
doubt this is practically needed.
This function will work for
stereo wavs. If you want it to work
for mono, re write it to write the
same sample twice. That is, for each
read, you do two writes (your input
is mono, but the player expects 2
samples).
I removed the EOF catch, as you can
know if you have enough data to read
from your buffer checking
bytesAvailable. Reaching the end of
stream is not exceptional in any
way, IMO, so I'd rather control that
case without an exception handler,
but this is just a personal
preference.