IDR and non IDR difference - h.264

What is the difference between IDR and non IDR frames?
What are
"coded slice of non IDR picture" ,
"Coded slice Data partition A",
"Coded slice Data partition B",
"Coded slice Data partition C",
"Coded slice of an IDR picture" ?

IDR, slices, partitioning - you have them all defined formally right there in specification:
3 Definitions
3.62 instantaneous decoding refresh (IDR) picture: A coded picture in which all slices are I or SI slices that
causes the decoding process to mark all reference pictures as "unused for reference" immediately after
decoding the IDR picture. After the decoding of an IDR picture all following coded pictures in decoding
order can be decoded without inter prediction from any picture decoded prior to the IDR picture. The first
picture of each coded video sequence is an IDR picture.
3.27 coded picture: A coded representation of a picture. A coded picture may be either a coded field or a coded
frame. Coded picture is a collective term referring to a primary coded picture or a redundant coded picture,
but not to both together.
3.136 slice: An integer number of macroblocks or macroblock pairs ordered consecutively in the raster scan within
a particular slice group. For the primary coded picture, the division of each slice group into slices is a
partitioning. Although a slice contains macroblocks or macroblock pairs that are consecutive in the raster
scan within a slice group, these macroblocks or macroblock pairs are not necessarily consecutive in the raster
scan within the picture. The addresses of the macroblocks are derived from the address of the first
macroblock in a slice (as represented in the slice header) and the macroblock to slice group map.
3.137 slice data partitioning: A method of partitioning selected syntax elements into syntax structures based on a category associated with each syntax element.

Related

Encoder number of outpus for opcode within a MIPS machine instruction

If I have an encoder with 8 data inputs, what is its maximum number of outputs?
I know that an encoder is a combinational circuit that performs the reverse operation of a decoder. It has a maximum of 2^n input lines and ‘n’ output lines, hence it encodes the information from 2^n inputs into an n-bit code. Since I have 8 data input, the output will be 3, since 2^3 = 8. Is that the correct assumption?
Let's try to tease apart the concepts of one hot (decoded) lines and an encoding using a number of bits.  Both these concepts are a way to represent information, but their form and typical usage is different.
One hot is a technique wherein at most one line is 1/true and all the other lines are 0/false.  These one hot lines are not considered digits in a number, but rather individual signals or conditions (only one of which is can be true at any given time).  This form is particularly useful in certain circuits, as each of the one hot lines can activate some other hardware.  (A hardware lookup table (LUT), a RAM or ROM may use one-hot within its internal array indexing.)
Encoding is a technique where we use N lines as digits in an N-bit number, as would be found in a CPU register holding a number, or as we might write normal binary numbers in text.  By contrast, in this form any of the N bits can be 1 (or 0).
Simple encoders & decoders translate between encoded form (N-bit numbers) and one hot form (2N lines).
... encoder ... has a maximum of 2^n input lines and ‘n’ output lines
In your statement, the 2^n input lines are in one hot form, while the output lines are normal numbers in binary (i.e. encoded).
Both the inputs (2^n lines) and the outputs (n lines) are capable of representing exactly 2^n different values!  As a result, decode/encode is a 1:1 mapping, back & forth.  (It would be an error to have multiple hots on the input side of such a decoder, and bad things would happen in a system that allowed that.)
In the formulas you're speaking to:  2N = V,  and   N = log2 ( V )  —  N stands for number of bits (a bit is a binary digit), and V stands for number of values that can be represented in N bits.
(While the 2's in these formulas are for binary — substitute 2 with 10 for the same relationships for number of decimal digits vs. number of values those number of digits can represent/store/communicate).
In one hot form we need V number of lines, whereas in encoded form we need N lines (as bits/digits) to represent the same information (one of V different values).
Consider whether a number you're looking is a digit count (as with N) or a value count (as with V).
And bear in mind that in one hot form, we need one line for each possible value, V (whereas in encoded form we need N bits for V possible values).
A MIPS processor will feed the 6 bit opcode field into a lookup table of some sort, in order to determine which set of control signals to activate for any given instruction.  (The opcode field is not one hot, but rather a bit field of N=6 bits).
These control signals are (also) not one hot, and the MIPS instruction decoder is not using a simple decoder, but rather a mapper that goes between encoded opcode values and effectively encoded control signals — this mapping is accomplished by lookup in a table.
These control signals are individual boolean values rather than as a set either one-hot or an encoded number.  One hot may be used internally in indexing of this mapping.  This mapping is basically an array lookup where the index is the opcode and each array element has all the individual control signal values appropriate its index.
(R-Type instructions all share a common opcode value, so when the R-Type opcode value is present, then additional lookup/mapping is done on the func bit field to generate the proper control signals.)

represent bitmask value in JSON object

What is the best way to represent bit-mask values in JSON object?
For example:
we want to know what ingredients user want in his fruit salad
Orange = 0x01
Apple = 0x02
Banana = 0x04
Grapes = 0x08
How would one represent the selected options in a JSON Object, obviously we can use integer value (i.e. 3 is for Orange and Apple) but it is not quite readable.
Is there are a better way?!
Researching a bit on this topic uncovered the following case study:
https://www.smartsheet.com/blog/smartsheet-api-formatting
It's not exactly the same problem, but it was good for extrapolating some solutions here:
Send a list of integers, from a predefined lookup table: e.g. [1, 3] (compromise between space and parsing)
Send the actual bit mask value (harder to parse, takes the least space)
Send a list of strings: e.g. [Orange, Banana] (easy to read, takes most space)
If space is not a constraint, I think the last options is the best.

understanding getByteTimeDomainData and getByteFrequencyData in web audio

The documentation for both of these methods are both very generic wherever I look. I would like to know what exactly I'm looking at with the returned arrays I'm getting from each method.
For getByteTimeDomainData, what time period is covered with each pass? I believe most oscopes cover a 32 millisecond span for each pass. Is that what is covered here as well? For the actual element values themselves, the range seems to be 0 - 255. Is this equivalent to -1 - +1 volts?
For getByteFrequencyData the frequencies covered is based on the sampling rate, so each index is an actual frequency, but what about the actual element values themselves? Is there a dB range that is equivalent to the values returned in the returned array?
getByteTimeDomainData (and the newer getFloatTimeDomainData) return an array of the size you requested - its frequencyBinCount, which is calculated as half of the requested fftSize. That array is, of course, at the current sampleRate exposed on the AudioContext, so if it's the default 2048 fftSize, frequencyBinCount will be 1024, and if your device is running at 44.1kHz, that will equate to around 23ms of data.
The byte values do range between 0-255, and yes, that maps to -1 to +1, so 128 is zero. (It's not volts, but full-range unitless values.)
If you use getFloatFrequencyData, the values returned are in dB; if you use the Byte version, the values are mapped based on minDecibels/maxDecibels (see the minDecibels/maxDecibels description).
Mozilla 's documentation describes the difference between getFloatTimeDomainData and getFloatFrequencyData, which I summarize below. Mozilla docs reference the Web Audio
experiment ; the voice-change-o-matic. The voice-change-o-matic illustrates the conceptual difference to me (it only works in my Firefox browser; it does not work in my Chrome browser).
TimeDomain/getFloatTimeDomainData
TimeDomain functions are over some span of time.
We often visualize TimeDomain data using oscilloscopes.
In other words:
we visualize TimeDomain data with a line chart,
where the x-axis (aka the "original domain") is time
and the y axis is a measure of a signal (aka the "amplitude").
Change the voice-change-o-matic "visualizer setting" to Sinewave to
see getFloatTimeDomainData(...)
Frequency/getFloatFrequencyData
Frequency functions (GetByteFrequencyData) are at a point in time; i.e. right now; "the current frequency data"
We sometimes see these in mp3 players/ "winamp bargraph style" music players (aka "equalizer" visualizations).
In other words:
we visualize Frequency data with a bar graph
where the x-axis (aka "domain") are frequencies or frequency bands
and the y-axis is the strength of each frequency band
Change the voice-change-o-matic "visualizer setting" to Frequency bars to see getFloatFrequencyData(...)
Fourier Transform (aka Fast Fourier Transform/FFT)
Another way to think about "time domain vs frequency" is shown the diagram below, from Fast Fourier Transform wikipedia
getFloatTimeDomainData gives you the chart on on the top (x-axis is Time)
getFloatFrequencyData gives you the chart on the bottom (x-axis is Frequency)
a Fast Fourier Transform (FFT) converts the Time Domain data into Frequency data, in other words, FFT converts the first chart to the second chart.
cwilso has it backwards.
the time data array is the longer one (fftSize), and the frequency data array is the shorter one (half that, frequencyBinCount).
fftSize of 2048 at the usual sample rate of 44.1kHz means each sample has 1/44100 duration, you have 2048 samples at hand, and thus are covering a duration of 2048/44100 seconds, which 46 milliseconds, not 23 milliseconds. The frequencyBinCount is indeed 1024, but that refers to the frequency domain (as the name suggests), not the time domain, and it the computation 1024/44100, in this context, is about as meaningful as adding your birth date to the fftSize.
A little math illustrating what's happening: Fourier transform is a 'vector space isomorphism', that is, a mapping going bijectively (i.e., reversible) between 2 vector spaces of the same dimension; the 'time domain' and the 'frequency domain.' The vector space dimension we have here (in both cases) is fftSize.
So where does the 'half' come from? The frequency domain coefficients 'count double'. Either because they 'actually are' complex numbers, or because you have the 'sin' and the 'cos' flavor. Or, because you have a 'magnitude' and a 'phase', which you'll understand if you know how complex numbers work. (Those are 3 ways to say the same in a different jargon, so to speak.)
I don't know why the API only gives us half of the relevant numbers when it comes to frequency - I can only guess. And my guess is that those are the 'magnitude' numbers, and the 'phase' numbers are thrown out. The reason that this is my guess is that in applications, magnitude is far more important than phase. Still, I'm quite surprised that the API throws out information, and I'd be glad if some expert who actually knows (and isn't guessing) can confirm that it's indeed the magnitude. Or - even better (I love to learn) - correct me.

Why do we transmit the IBP frames out of order?

We transmit the IBP frames like IPBBPBB and then we display them in IBBPBBP. This is the question, why do we do that. I can't just visualize it in my head. I mean, why not just transmit them in the order they are to be displayed?.
With bi-directional frames in temporal compression, decoding order (order in which the data needs to be transmitted for sequential decoding) is different from presentation order. This explains the effect you are referring to.
On the picture below, you need data for frame P2 to decode frame B1, so when it comes to transmission, P2 goes ahead.
See more on this: B-Frames in DirectShow
(source: monogram.sk)
Since MPEG-2 had appeared a new frame type was introduced - the bi-directionally predicted frame - B frame. As the name suggests the frame is derived from at least two other frames - one from the past and one from the future (Figure 2).
Since the B1 frame is derived from I0 and P2 frames both of them must be available to the decoder prior to the start of the decoding of B1. This means the transmission/decoding order is not the same as the presentation order. That’s why there are two types of time stamps for video frames - PTS (presentation time stamp) and DTS (decoding time stamp).
Briefly: it's done for decoder's speedup. You've mentioned usual MPEG2 GOP (Group of Pictures), so I'll try to explain answer for MPEG2. Though, H264 uses absolutely same logic.
For coder, picture difference is less, when calculated using not only previous frames, but successive frames too. That's why coder (usually) processes frames in display order. So, in IBBPBBP GOP every B frame can use previous I frame & next P frame to make prediction.
For decoder, it's better when every successive frame uses only previous frames for prediction. That's why pictures in bitstream are reordered. In reordered group IPBBPBB, every B frame uses I frame & P frame, which both are previous, and that's faster.
Also, every frame has it's own PTS (presentation timestamp), that implicitly determines display order - so it's no big deal that reordering is made.
This wikipedia article can give you answer to your question.

GIF format - separate variable-length codes

I trying to parse GIF format and have one problem with reading image data.
This data represented like bit array, containing variable-length values.
ex:
0010-1010-0010-0000-00111-10000-11111...
Sometimes length of the code increases, but I can't understand how I can detect this increasing.
I have just initial code size (length of the first code ex. 4).
Standart says only:
Appendix F. Variable-Length-Code LZW Compression.
...
The Variable-Length-Code aspect of the algorithm is based on an initial code size
(LZW-initial code size), which specifies the initial number of bits used for
the compression codes. When the number of patterns detected by the compressor
in the input stream exceeds the number of patterns encodable with the current
number of bits, the number of bits per LZW code is increased by one.
...
When parsing a GIF file, the Image Descriptor includes the bit width of the unencoded symbols (example: 8 bits).
As you probably already know, the initial code size of the compressed data is one bit wider than the bit width of the unencoded symbols (example: 9 bits).
Also, as you probably already know, the possible compressed code values in a GIF file gradually increase in size,
up to a maximum of 0xFFF == 4095 which requires 12 bits to store.
For every code that the decompressor pulls from the compressed data,
the decompressor adds a new item to its dictionary.
For example, if the first two 9-bit codes the decompressor reads are 0x028 0x0FF,
the decompressor adds a two-byte sequence to its dictionary.
Later if the decompressor ever reads the code 0x102,
the decompressor decodes that 0x102 code to the two 8-bit bytes 0x28 0xFF.
The next item the decompressor adds to the dictionary is assigned the code 0x103.
The next item the decompressor adds to the dictionary is assigned the code 0x104. ...
Eventually the decompressor adds an item to the dictionary that is assigned the code 0x1FF.
That is the largest number that fits into 9 bits.
After storing that item into the dictionary,
the decompressor starts reading 10-bit codes.
The next item the decompressor adds to the dictionary is assigned the code 0x200.
There isn't any special "command" in the data sequence that tells the decompressor to increment the code width.
The decompressor must keep track of how many items the dictionary contains so far (which often can be conveniently re-used as the index of where to write the next item into the dictionary).
When the decompressor adds item 0x1ff to the dictionary, it's time for the decompressor to start reading 10-bit codes.
When the decompressor adds item 0x3ff to the dictionary, it's time for the decompressor to start reading 11 bit codes.
Wikipedia: Graphics Interchange Format
Wikipedia: Lempel–Ziv–Welch with variable-length codes
Look at this example first, it may be clearer to understand LZW than looking at the standard.
And this may also be useful.