Why do we transmit the IBP frames out of order? - h.264

We transmit the IBP frames like IPBBPBB and then we display them in IBBPBBP. This is the question, why do we do that. I can't just visualize it in my head. I mean, why not just transmit them in the order they are to be displayed?.

With bi-directional frames in temporal compression, decoding order (order in which the data needs to be transmitted for sequential decoding) is different from presentation order. This explains the effect you are referring to.
On the picture below, you need data for frame P2 to decode frame B1, so when it comes to transmission, P2 goes ahead.
See more on this: B-Frames in DirectShow
(source: monogram.sk)
Since MPEG-2 had appeared a new frame type was introduced - the bi-directionally predicted frame - B frame. As the name suggests the frame is derived from at least two other frames - one from the past and one from the future (Figure 2).
Since the B1 frame is derived from I0 and P2 frames both of them must be available to the decoder prior to the start of the decoding of B1. This means the transmission/decoding order is not the same as the presentation order. That’s why there are two types of time stamps for video frames - PTS (presentation time stamp) and DTS (decoding time stamp).

Briefly: it's done for decoder's speedup. You've mentioned usual MPEG2 GOP (Group of Pictures), so I'll try to explain answer for MPEG2. Though, H264 uses absolutely same logic.
For coder, picture difference is less, when calculated using not only previous frames, but successive frames too. That's why coder (usually) processes frames in display order. So, in IBBPBBP GOP every B frame can use previous I frame & next P frame to make prediction.
For decoder, it's better when every successive frame uses only previous frames for prediction. That's why pictures in bitstream are reordered. In reordered group IPBBPBB, every B frame uses I frame & P frame, which both are previous, and that's faster.
Also, every frame has it's own PTS (presentation timestamp), that implicitly determines display order - so it's no big deal that reordering is made.
This wikipedia article can give you answer to your question.

Related

Data model for timeline event synchronisation

I am looking for ideas on the data model for the following problem (and the proper CS terminology):
A (horizontal) "timeline" with several rows (A,B,C) contains "events" (1,2,3) width different durations (width) at different times (absolute x position or by delay "." after previous event):
A 1111....222222
B 33333
------------------
T 0123456789ABCDEF
(The rows are only interesting for graphical representation of overlapping/parallel "events", so they probably are not essential to the data model.)
Event duration may vary, affecting the whole timing:
A 11....222222
B 33333+3
------------------
T 0123456789ABCDEF
But let event 2 require events 1 and 3 to be finished, so the timing should look like this:
A 11.... 222222
B 33333+3
------------------
T 0123456789ABCDEF
(let's ignore that the original delay at T=7 is now missing.)
Originally I thought I'd have to have some "elastic" synchronization elements, one for each row:
A 11....####222222
B 33333+3#
------------------
T 0123456789ABCDEF
Thus the original problem of how to model and sync the sync elements in the two different "rows". But, as established above, this is only a matter of graphical/parallel representation.
Rather, the sync is a condition that could be "attached" to event 2, modifiying or determining its beginning.
If an event "has" a condition, it will not have an absolute or relative start time. Its start can only be determined at the ends of the "linked" events (1 and 3).
So, given (a list of) some events with variable duration and either an absolute start time or a delay relative to another event's end, how could the condition "events 1 and 3 ended" be modelled to determine the start of "event 2"?
(I will prototype this in JavaScript and eventually implement in C/C++, so any sample code provided should not use high-level data types or libraries.)
What you need is an object that I would call a TimeFrame. The object would have the attributes duration, link and type, where link can be a precise time or a link to another TimeFrame and type accounts for the kind of link. For instance, a given TimeFrame that starts at a known time would have that time as its link attribute and the type would be TIME. A TimeFrame that is linked to the end of another would have that other TimeFrame as its link attribute and START-END as its type and so on.
Using the combination between link and type you could also support other types of links such as START-START, END-START or END-END.
UPDATE
Also, in order to allow some time interval between say, the end of a TimeFrame and the start of the next, one can add the attribute lag, which represents any delay between events. So, for instance if tf1 and tf2 are TimeFrames such that tf2 must start 5 time units after the end of tf1 the attributes of tf2 would be link = tf1, type = START-END, duration = <something> and lag = 5. Note also that the lag could be negative, which would extend the expressiveness of the model to a broad range of relationships.
While #Leandro Caniglia nicely rephrased my question into an Object and Attributes, essentially, I see two options:
the whole list of "events" needs to be evaluated at "condition" (start/end) to check dependent "events".
adding a "link" to a "parent" also creates a link to the "child" (no need to evaluate all pending event's links).
Also:
The "link" property needs to be a List or Array to be able hold several references (e.g. 2:[1,3]).
Analogous to the link property start_me_on_condition a stop_me_on_condition association may be desirable (see Leandro's suggestion of type, it would need to be extended to support multiple links+type)
An independet delay "event" might be more practical than a lag property.

understanding getByteTimeDomainData and getByteFrequencyData in web audio

The documentation for both of these methods are both very generic wherever I look. I would like to know what exactly I'm looking at with the returned arrays I'm getting from each method.
For getByteTimeDomainData, what time period is covered with each pass? I believe most oscopes cover a 32 millisecond span for each pass. Is that what is covered here as well? For the actual element values themselves, the range seems to be 0 - 255. Is this equivalent to -1 - +1 volts?
For getByteFrequencyData the frequencies covered is based on the sampling rate, so each index is an actual frequency, but what about the actual element values themselves? Is there a dB range that is equivalent to the values returned in the returned array?
getByteTimeDomainData (and the newer getFloatTimeDomainData) return an array of the size you requested - its frequencyBinCount, which is calculated as half of the requested fftSize. That array is, of course, at the current sampleRate exposed on the AudioContext, so if it's the default 2048 fftSize, frequencyBinCount will be 1024, and if your device is running at 44.1kHz, that will equate to around 23ms of data.
The byte values do range between 0-255, and yes, that maps to -1 to +1, so 128 is zero. (It's not volts, but full-range unitless values.)
If you use getFloatFrequencyData, the values returned are in dB; if you use the Byte version, the values are mapped based on minDecibels/maxDecibels (see the minDecibels/maxDecibels description).
Mozilla 's documentation describes the difference between getFloatTimeDomainData and getFloatFrequencyData, which I summarize below. Mozilla docs reference the Web Audio
experiment ; the voice-change-o-matic. The voice-change-o-matic illustrates the conceptual difference to me (it only works in my Firefox browser; it does not work in my Chrome browser).
TimeDomain/getFloatTimeDomainData
TimeDomain functions are over some span of time.
We often visualize TimeDomain data using oscilloscopes.
In other words:
we visualize TimeDomain data with a line chart,
where the x-axis (aka the "original domain") is time
and the y axis is a measure of a signal (aka the "amplitude").
Change the voice-change-o-matic "visualizer setting" to Sinewave to
see getFloatTimeDomainData(...)
Frequency/getFloatFrequencyData
Frequency functions (GetByteFrequencyData) are at a point in time; i.e. right now; "the current frequency data"
We sometimes see these in mp3 players/ "winamp bargraph style" music players (aka "equalizer" visualizations).
In other words:
we visualize Frequency data with a bar graph
where the x-axis (aka "domain") are frequencies or frequency bands
and the y-axis is the strength of each frequency band
Change the voice-change-o-matic "visualizer setting" to Frequency bars to see getFloatFrequencyData(...)
Fourier Transform (aka Fast Fourier Transform/FFT)
Another way to think about "time domain vs frequency" is shown the diagram below, from Fast Fourier Transform wikipedia
getFloatTimeDomainData gives you the chart on on the top (x-axis is Time)
getFloatFrequencyData gives you the chart on the bottom (x-axis is Frequency)
a Fast Fourier Transform (FFT) converts the Time Domain data into Frequency data, in other words, FFT converts the first chart to the second chart.
cwilso has it backwards.
the time data array is the longer one (fftSize), and the frequency data array is the shorter one (half that, frequencyBinCount).
fftSize of 2048 at the usual sample rate of 44.1kHz means each sample has 1/44100 duration, you have 2048 samples at hand, and thus are covering a duration of 2048/44100 seconds, which 46 milliseconds, not 23 milliseconds. The frequencyBinCount is indeed 1024, but that refers to the frequency domain (as the name suggests), not the time domain, and it the computation 1024/44100, in this context, is about as meaningful as adding your birth date to the fftSize.
A little math illustrating what's happening: Fourier transform is a 'vector space isomorphism', that is, a mapping going bijectively (i.e., reversible) between 2 vector spaces of the same dimension; the 'time domain' and the 'frequency domain.' The vector space dimension we have here (in both cases) is fftSize.
So where does the 'half' come from? The frequency domain coefficients 'count double'. Either because they 'actually are' complex numbers, or because you have the 'sin' and the 'cos' flavor. Or, because you have a 'magnitude' and a 'phase', which you'll understand if you know how complex numbers work. (Those are 3 ways to say the same in a different jargon, so to speak.)
I don't know why the API only gives us half of the relevant numbers when it comes to frequency - I can only guess. And my guess is that those are the 'magnitude' numbers, and the 'phase' numbers are thrown out. The reason that this is my guess is that in applications, magnitude is far more important than phase. Still, I'm quite surprised that the API throws out information, and I'd be glad if some expert who actually knows (and isn't guessing) can confirm that it's indeed the magnitude. Or - even better (I love to learn) - correct me.

x264 rate control modes

Recently I am reading the x264 source codes. Mostly, I concern the RC part. And I am confused about the parameters --bitrate and --vbv-maxrate. When bitrate is set, the CBR mode is used in frame level. If you want to start the MB level RC, the parameters bitrate, vbv-maxrate and vbv-bufsize should be set. But I don't know the relationship between bitrate and vbv-maxrate. What is the criterion of the real encoding result when bitrate and vbv-maxrate are both set?
And what is the recommended value for bitrate? Equals to vbv-maxrate?
Also what is the recommended value for vbv-bufsize? Half of vbv-maxrate?
Please give me some advice.
bitrate address the "target filesize" when you are doing encoding. It is understandably confusing because it applies a "budget" of certain size and then tries to apportion this budget on the frames - that is why the later parts of a movie get a smaller amount of data which results in lower video quality. For example, if you have 10 seconds of complete black images followed by 10 second of natural video - the final encoded file will be very different than if the order was the opposite.
vbv-bufsize is the buffer that has to be completed before a "transmission" would occur say in a streaming scenario. Now, let's tie this to I-frames and P-frames: the vbv-bufsize will limit the size of any of your encoded video frames - most likely the I-frame.

Unlimited Map Dimensions for a game in AS3

Recently I've been planning out how I would run a game with an environment/map that is capable of unlimited dimensions (unlimited being a loose terms as there's obviously limitations on how much data can be stored in memory, etc). I've achieved this using a "grid" that contains level data stored as a String that can be converted to a 2D Array that would represent objects and their properties.
Here's an example of two objects stored as a String:
"game.doodads.Tree#200#10#terrain$game.mobiles.Player#400#400#mobiles"
The "grid" is a 3D Array, of which the contents would represent the x/y coordinate of the grid cell. The grid cells would be, say, 600x600.
An example of this "grid" Array would be as follows:
var grid:Array = [[["leveldata: 0,0"],["leveldata 0,1"]],
[["leveldata: 1,0"],["leveldata 1,1"]]];
The environment will handle loading a grid square and it's 8 surrounding squares based on a given point. ie the position of the Player. It would have a function along the lines of
function loadCells(xp:int, yp:int):void
This would also handle the unloading of the previously loaded cells that are no longer close enough to be required. In the unload process, the data at grid[x][y] would be overwritten with the new data, which is created by looping through the objects in that cell and appending each new set of data to the grid cell data.
Everything works fine in terms of when you move in a direction, cells are unloaded/saved and new cells are loaded. The problem is this:
Say this is a large city infested by zombies. If you walk three grid squares in any direction and return, everything is as you left it. I'm struggling to find a way to at least simulate all objects still moving around and doing their thing. It looks silly when you for example throw a grenade, walk away, return and the grenade still hasn't detonated.
I've considered storing a timestamp on each object when I unload the level, and when it's initialized there's a loop that runs it's "step" function amount of times. Problem here is obviously that when you come back 5 minutes later, 20 zombies are going to try and step 248932489 times and the game will crash.
Ideas?
I don't know AS3 but let me try give you some tips.
It seems you want to make a seamless world since you load / unload cells as a player moves. That's a good idea. What you should do here is deciding what data a cell should load / unload( or, even further, what data a cell should hold or handle ).
For example, the grenade should not be unloaded by the cell as it needs to be updated even after the player leaves the cell. It's NOT a good idea for a cell to manage all game objects just because they are located in the cell. Instead, the player object could handle the grenade as an owner. Or, there could be one EntityManager which handles all game entities like grenades or zombies.
With this idea, you can update your 20 zombies even when they are in an unloaded zone (their location does not matter anymore) instead of calling update() 248932489 tiems at once. However, do you really need to keep updating the zombies? Perhaps, unloading all of them and spawning new zombies with the number of the latest active zombies in the cell would work. It depends on your game design but, usually, you don't have to update entities which are not visible or far enough from the player. Hope it helps. Good luck! :D
Interesting problem. Of course, game cannot simulate unlimited environment precisely. If you left some zone for a few minutes, you don't need every step of zombies (or any actors there) to be simulated precisely. Every game has its own simplications. Maybe approximate simulation will help you. For example, if survivor was in heavily infested zone for a long time, your simulator could decide that he turned into zombie, without computing every step of the process. Or, if horde was rampant in some part of the city, this part should be damaged randomly.
I see two methods as to how you could handle this issue.
first: you have the engine keep any active cells loaded, active means there are object-initiated events involving cell-owned objects OR player-owned objects (if you have made such a distinction that is). Each time such a cell is excluded from normal unloading behaviour it is assigned a number and if memory happens to run out the cells which have the lowest number will be unloaded. Clearly, this might be the hardest method code-wise but still it might be the only one truly doing what you desire.
second: use very tiny cells and let the engine keep a narrow path loaded. The cells migth then be 100x100 instead of 600x600 and 36 of them do the work one bigger cell would do ( risk: more cluttter code-wise) then each cell your player actually traverses ( not all that have been loaded to produce a natural visibility-range ) is kept in memory including every cell that has player-owned objects in it for a limited amount of time ( i.e. 5 minutes ).
I believe you can find out how to check for these conditions upon unloading yourself and hope to have been of help to you.

Reconstructing state from time series data events

For a particular project, we acquire data for a number of events and collect variables about those events at the same time. After the data has been collected, we perform a user-customizable analysis on said data to determine whatever it is that the user is interested in.
The data is collected in a form similar to this:
Timestamp Event
0 x = 0
0 y = 1
3 Event A occurred
3 x = 1
4 Event A occurred
4 x = 2
9 Event B occurred
9 y = 2
9 x = 0
To understand the entire state at any time, the most straightforward approach is to walk over the entire set of data. For example, if I start at time 0, and "analyze" until timestamp 5, I know that at that point x = 2, y = 1, and Event A has occurred twice. That's a really simple example. The user might be (and often is) interested in the time between events, say from A to B, and they might specify the first occurrence of A, then B, or the last occurrence of A, then B (respectively, 9-3 = 6 or 9-4 = 5). Like I said, this is easy to analyze when you're walking over the entire set.
Now, we need to adapt the model to analyze an arbitrary window of time. If we look at 0-N, that's the easy case. But if I look at 1-5, for instance, I have no notion of y unless I begin at 0 and know that y was initially 1 and did not change in the window 1-5.
Our approach is to essentially create a dictionary of variables, and run callbacks on events. If one analysis was "What is x when Event A occurs and time is > 3" then we would run that callback on the first Event A, and it would immediately return because time is not greater than 3. It would run again at 4, and and it would report that x was 1 at t=4.
To adapt to the "time-windowing", I think I am going to (in the background) tack on additional conditions to the analysis. If their analysis is just "What is x when Event A occurs", and the current window is 1-5, then I will change it to "What is x when Event A occurs and time >= 1 and time <= 5". Then if the next window is 6-10, I can readjust the condition as necessary.
My main question is: what pattern does this fit? We are obviously not the first people to approach a problem like this, but I have not been able to find how others have approached it. I probably just don't know what exactly to search on Google. Is there any other approach besides keeping a dictionary of the entire global state for looking up a single state at a given time? Note also that the data could have several, maybe tens of thousands of records, so the fewer iterations over the data set, the better.
I think your best approach here would be to take periodic "snapshots" of the full state data, say every 1000 samples (for example), along with recording the deltas. When you're storing your data as offsets from some original value (aka deltas), you don't have any choice but to reconstruct the full data starting with the original values. Storing periodic snapshots will lessen the amount of reconstruction you have to do - the design tradeoff is between low storage requirements but long reconstruction time on the one hand, and higher storage requirements but shorter reconstruction time on the other.
MPEGs, for example, store each frame as the differences between the current frame and the previous frame. Ordinarily, this would force an MPEG to be viewed from the beginning, but the format also periodically stores full frames so that the decoder doesn't have to backtrack all the way to the beginning of the file.
You can search by time in Log(N), and you can have a feeling about how many updates ares acceptable... hence here's my solution:
Pick a number, N, of updates that are acceptable in order to return a result. 256 might be good, given the scales you've mentioned so far.
Every N records, commit an entry of all state to a dictionary, with a timestamp.
Now, you have a tradeoff, dictionary size against speed. N->\infty is regular searching. N<-1 is your current solution, N anywhere else will require less memory, but be slower.
Your implementation is now (for time X):
Log(n) search of subsampled global dictionary to timestamp before X, (timestamped as Y).
Log(n) search of eventlist to timestamp Y, and perform less than N updates.
Picking N as a power of two will even allow you to do some nice shift tricks to do a rounded-down integer divide nice and fast.