Recently I am reading the x264 source codes. Mostly, I concern the RC part. And I am confused about the parameters --bitrate and --vbv-maxrate. When bitrate is set, the CBR mode is used in frame level. If you want to start the MB level RC, the parameters bitrate, vbv-maxrate and vbv-bufsize should be set. But I don't know the relationship between bitrate and vbv-maxrate. What is the criterion of the real encoding result when bitrate and vbv-maxrate are both set?
And what is the recommended value for bitrate? Equals to vbv-maxrate?
Also what is the recommended value for vbv-bufsize? Half of vbv-maxrate?
Please give me some advice.
bitrate address the "target filesize" when you are doing encoding. It is understandably confusing because it applies a "budget" of certain size and then tries to apportion this budget on the frames - that is why the later parts of a movie get a smaller amount of data which results in lower video quality. For example, if you have 10 seconds of complete black images followed by 10 second of natural video - the final encoded file will be very different than if the order was the opposite.
vbv-bufsize is the buffer that has to be completed before a "transmission" would occur say in a streaming scenario. Now, let's tie this to I-frames and P-frames: the vbv-bufsize will limit the size of any of your encoded video frames - most likely the I-frame.
Related
I read the following from the official docs on MediaCodec:
Raw audio buffers contain entire frames of PCM audio data, which is one sample for each channel in channel order. Each PCM audio sample is either a 16 bit signed integer or a float, in native byte order.
https://source.android.com/devices/graphics/arch-sh
The way I read this is that a buffer contains an entire frame of audio but a frame is just one signed integer. This doesn't seem to make sense. Or is this two values for the left and right audio? Why call it a buffer when it only contains a single value? To me, a buffer refers to several values spanning several milliseconds.
Here's what the docs for AudioFormat say:
For linear PCM, an audio frame consists of a set of samples captured at the same time, whose count and channel association are given by the channel mask, and whose sample contents are specified by the encoding. For example, a stereo 16 bit PCM frame consists of two 16 bit linear PCM samples, with a frame size of 4 bytes.
You are right that it doesn't make sense to use a buffer for just one frame. And in practice buffers are filled with many frames.
You can figure out the number of frames in a buffer from the size property of MediaCodec.BufferInfo and the frame size.
I am estimating the encoding and decoding performance of HDVICP2 of TI DM385. The SOC is very very old so FAEs just ignored me.
My question is: How to calculate encoding / decoding performance of HDVICP2 encoder / decoder correctly?
For example:
According to "Table 2 Cycle Information" of datasheet "H264 High Profile Decoder 2.0 on HDVICP2 and Media Controller Based Platform Data Sheet (SPRS839)", I can find the average and peak cycles per second of test file "station_p1920x1080_7mbps_IPB_30fps.264" are 155.57 and 160.63 separately.
If every operation of HDVICP2 takes only one cycle, I can simply say that a 1080p60 video with the same characteristic will take about 2 times cycles per second so the decoder is able to deal with only one 1080p60 video but unable to deal with more 1080p60 video at the same time. (If working frequency is 533MHZ, 155.57*2=311 and smaller than 533)
Is above hypothesis correct? If no, please answer me why. Thanks very much.
I'd like to do some stuff with h.264 data recorded from Android phone.
My colleague told me there should be 4 bytes right after mdat wich specifies NALU size, then one byte with NALU metadata and then the raw data, and then (after NALU size), another 4 bytes with another NALU size and so on.
But I have a lot of zeros right after mdat:
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0e00000000000000000000000000000000000000000000000000000000000000
8100000000000000000000000000000000000000000000000000000000000000
65b84f0f87fa890001022e7fcc9feef3e7fabb8e0007a34000f2bbefffd07c3c
bfffff08fbfefff04355f8c47bdfd05fd57b1c67c4003e89fe1fe839705a699d
c6532fb7ecacbfffe82d3fefc718d15ffffbc141499731e666f1e4c5cce8732f
bf7eb0a8bd49cd02637007d07d938fd767cae34249773bf4418e893969b8eb2c
Before mdat atom are just ftyp mp42, isom mp42 and free atoms. All other atoms (moov, ...) are at the end of the file (that's what Android does, when it writes to socket and not to the file). But If necessary, I've got PPS and SPS from other file with same camera and encoder settings recorded just a seond before this, just to get those PPS and SPS data.
So how exactly can i get NALUs from that?
You can't. The moov atom contains information required to parse the mdat. Without it the mdat has little value. For instance, the first NALU does not need to start at the begining of the mdat, It can start anywhere within the mdat. The byte it starts at is recorded in (I believe) the stco box. If the file has audio, you will find audio and video mixed within mdat with no way to determine what is what without the chunk offsets. In addition, if the video has B frames, there is no way to determine render order without the cts, again only available in the moov. And Technically, the nalu size does not need to be 4 bytes and you cant know that without the moov. I recommend not used mp4. Use a streamable container such as ts or flv. Now if you can make some assumption about the code that is producing the file; Like the chunk offset is always the same, and there is no b frames, you can hard code these values. But is not guaranteed to work after a software update.
Does H.264 buffer contains Time stamp and decoding time stamp information.
when we get the H.264 nalu data does that contain timing information in it?
If you mean raw H.264 NAL units than no they don't contain timing information if mean something like PTS/DTS. Timestamps are on higher level in containers like MKV/MP4/TS. The only time related information in H.264 specs afaik are num_units_in_tick/time_scale in VUI that can be used to finding FPS in case of constant frame rate (fixed_frame_rate_flag = 1), and some fields in Picture timing SEI but as they are optional and not really well specified so nobody really use them.
We transmit the IBP frames like IPBBPBB and then we display them in IBBPBBP. This is the question, why do we do that. I can't just visualize it in my head. I mean, why not just transmit them in the order they are to be displayed?.
With bi-directional frames in temporal compression, decoding order (order in which the data needs to be transmitted for sequential decoding) is different from presentation order. This explains the effect you are referring to.
On the picture below, you need data for frame P2 to decode frame B1, so when it comes to transmission, P2 goes ahead.
See more on this: B-Frames in DirectShow
(source: monogram.sk)
Since MPEG-2 had appeared a new frame type was introduced - the bi-directionally predicted frame - B frame. As the name suggests the frame is derived from at least two other frames - one from the past and one from the future (Figure 2).
Since the B1 frame is derived from I0 and P2 frames both of them must be available to the decoder prior to the start of the decoding of B1. This means the transmission/decoding order is not the same as the presentation order. That’s why there are two types of time stamps for video frames - PTS (presentation time stamp) and DTS (decoding time stamp).
Briefly: it's done for decoder's speedup. You've mentioned usual MPEG2 GOP (Group of Pictures), so I'll try to explain answer for MPEG2. Though, H264 uses absolutely same logic.
For coder, picture difference is less, when calculated using not only previous frames, but successive frames too. That's why coder (usually) processes frames in display order. So, in IBBPBBP GOP every B frame can use previous I frame & next P frame to make prediction.
For decoder, it's better when every successive frame uses only previous frames for prediction. That's why pictures in bitstream are reordered. In reordered group IPBBPBB, every B frame uses I frame & P frame, which both are previous, and that's faster.
Also, every frame has it's own PTS (presentation timestamp), that implicitly determines display order - so it's no big deal that reordering is made.
This wikipedia article can give you answer to your question.