Maximum bitstream size for H.264 - h.264

Although I am quite familiar with H.264 encoding I came down to a point where I need advice from more experienced people. I'm performing hardware accelerated H.264 encoding using Intel Quick Sync and NVIDIA NVENC in a unified pipeline. The issue that troubles me is bitstream output buffer size. Intel Quick Sync provides a way to query the maximum bitstream size from the encoder, while NVDIA NVENC does not have such a feature (or at least I haven't found it, pointers are welcome). In their tutorial they state that:
NVIDIA recommends setting the VBV buffer size equal to single frame size. This is very
helpful in low latency applications, where network bandwidth is a concern. Single frame
VBV allows users to enable capped frame size encoding. In single frame VBV, VBV
buffer size must be set to maximum frame size which is equal to channel bitrate divided
by frame rate. With this setting, every frame can be sent to client immediately upon
encoding and the decoder can also decode without any buffering.
For example, if you have a channel bitrate of B bits/sec and you are encoding at N fps,
the following settings are recommended to enable single frame VBV buffer size.
uint32_t maxFrameSize = B/N;
NV_ENC_RC_PARAMS::vbvBufferSize= maxFrameSize;
NV_ENC_RC_PARAMS::vbvInitialDelay= maxFrameSize;
NV_ENC_RC_PARAMS::maxBitRate= NV_ENC_CONFIG::vbvBufferSize *N; // where N is the encoding frame rate.
NV_ENC_RC_PARAMS::averageBitRate= NV_ENC_RC_PARAMS::vbvBufferSize *N; // where N is the encoding frame rate.
NV_ENC_RC_PARAMS::rateControlMode= NV_ENC_PARAMS_RC_TWOPASS_CBR;
I am allocating a bitstream buffer pool for quite many encoding sessions so having an overhead of unused memory for each buffer by calculating the size from network bandwidth (in my case it is not the bottleneck) will cause ineffective memory usage.
So the general question is - is there any way how to determine the bitstream size for H.264 assuming there is no frame buffering and each frame should generate NAL units? Can I assume that it will never be larger than the input NV12 buffer (which seems unreliable since there may be many NAL units like SPS/PPS/AUD/SEI for the first frame and I am not sure if the size of those plus the same of IDR frame is not greater than the NV12 buffer size)? Does the standard have any pointers on this? Or is it totally encoder dependent?

Related

comparing h.264 encoding decoding performance

I am beginner of video codec. not an video codec expert
I just want to know base on the same criteria, Comparing H254 encoding/decoding which is more efficiency.
Thanks
Decoding is more efficient. To be useful, decoding must run in real time, where encoding does not (except in videophone / conferencing applications).
How much more efficient? An encoder can generate motion vectors. The more compute power used on generating those motion vectors, the more accurate they are. And, the more accurate they are, the more bandwidth is available for the difference frames, so the quality goes up.
So, the kind of encoding used to generate video for streaming or distribution on DVD or BD discs can run many times slower than real time on server farms. But decoding for that kind of program is useless unless it runs in real time.
Even in the case of real-time encoding it takes more power (actual milliwatts, compute cycles, etc) than decoding.
It's true of H.264, H.265, VP8, VP9, and other video codecs.

Max bitrate value for Google chrome browser

I have a simple question.
What is the current maximum bitrate value supported by Google Chrome browser for web camera ?
For example, if I have a virtual source with high bitrate output (constant bitrate 50 Mbits)
Would I be able to get all 50 Mbits in my Chrome browser when using this device?
Thank you.
The camera's bitrate is irrelevant in this case, since WebRTC is going to encode that information using a video codec that compresses that information anyway.
What matters for WebRTC are 4 separate parameters:
The resolution supplied and the one the other end of the session is capable of receiving
The frame rate supplied and the one the other end of the session is capable of receiving
The network conditions - there's a limit enforced by the network and it is dynamic in nature, so WebRTC will try to estimate it at all times and accommodate to it
The maximum bitrate imposed by the participants
WebRTC in its nature will not limit the amount of bandwidth it takes and will try to use as much as it possibly can. That said, the actual bitrate used even without any limits will still depend on (1), (2) and the type of codec being used. It won't reach 50mbps...
For the most part, 2.5mbps will be enough for almost any type of content in WebRTC. 1080p will take up to 4mbps and 4K probably around 15mbps.

How to make Media Foundation H.264 decoder work?

For some reason I'm not able to decode H.264.
The input/output configuration went well, just like input/output buffer creation.
I'm manually feeding the decoder with the H.264 demuxed from a live stream. Therefore, I use MFVideoFormat_H264_ES as media subtype. The decoding is very slow and the decoded frames are complete garbage. Other decoders are decoding the same stream properly.
Weird thing is that once ProcessInput() returns MF_E_NOTACCEPTING, the following ProcessOutput() returns MF_E_TRANSFORM_NEED_MORE_INPUT. According to MSDN, this should never happen.
Can anybody provide some concrete info on how to do it? (assuming that MF H.264 is functional, which I seriously doubt).
I'm willing to provide extra information, but I don't know what somebody might need in order to help.
Edit:
When exactly should I reset the number of bytes in input buffer to zero?
Btw, I'm resetting the output buffer when ProcessOutput() delivers something (garbage).
Edit2:
Without resetting the current buffer size on input buffer to 0, I managed to get some semi valid output. By semi valid I mean that on every successful ProcessOutput() I receive an YUV image where current image contains a few decoded macro blocks more than the previous frame. The rest of the frame is black. Because I do not reset the size, this stops after a while. So, I guess there is a problem in resetting the buffer size and I guess I should get some notification when the whole frame is done (or not).
Edit3:
While creating input buffer, GetInputStreamInfo() returns 4096 as input buffer size. Alignment 0. However, 4k is not enough. Increasing to 4MB helps in decompressing frame fragment by frame fragment. Still have to figure out if there is a way to tell when is the entire frame decoded.
When creating input buffer, GetInputStreamInfo() returns 4096 as buffer size, which is too small.
Setting input buffer to 4MB solved the problem. The buffer can probably be smaller... still have to test that.

RTP fragmentation vs UDP fragmentation

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?
Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.
RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.
I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.
Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

HLS - how to reduce delay?

Anyone know how configure the HLS media server for reduce a little bit the delay of live streaming video?
what types of parameters i need to change?
I had heard that you could do some tuning using parameters like this: HLSMediaFileDuration
Thanks in advance
A Http Live Streaming system typically has an encoder which produces segments of a certain number of seconds and a media server (web server) which serves playlists containing a list of URLs to these segments to player applications.
Media Files = Segments = .ts files = MPEG2-TS files (in HLS speak)
There are some ways to reduce the delay in HLS:
Reduce the encoded segment length from Apple's recommended 10 seconds to 5 seconds or less. Reducing segment length increases network overhead and load on the web server.
Use lower bitrates, larger .ts files take longer to upload and download. If you use multi-bitrate streams, make sure the first bitrate listed in the playlist is a little lower than the bitrate most of your users use. This will reduce the time to start playing back the stream
Get the segments from the encoder to the web server faster. Upload while still encoding if possible. Update the playlist as soon as the segment has finished uploading
Also remember that the higher the delay the better the quality of your stream (low delay = lower quality). With larger segments, there is less overhead so more space for video data. Taking a longer time to encode results in better quality. More buffering results in less chance of video streams stuttering on playback.
HLS is all about trading quality of playback for longer delay, so you will never be able to use HLS for things like video conferencing. Typical delay in HLS is 30-60 sec, minimum in practice is around 15 sec. If you want low delay use RTP for streaming, but good luck getting good quality on low or variable speed networks.
Please specify which media server you use. Generally speaking, yes - changing chunk size will definitely affect delay time. The less is the first chunk, the quicker the video will be shown in the player.
Actually, Apple recommend to divide your file to small chunks this equal length of file, integers.
In practice, there is huge difference between players. Some of them parse manifest changing this values.
Known practice is to pre-cache in memory first chunks in low & medium resolution (Or try to download them in background of app/page - Amazon does this, though their video is MSS)
I was having the same problem and the keys for me were:
Lower the segment length. I set it to 2s because I'm streaming on a local network. In other type of networks, you need to be careful with the overhead that a low segment length adds that can impact your playback quality.
In your manifest, make sure the #EXT-X-TARGETDURATION is accurate. From here:
The EXT-X-TARGETDURATION tag specifies the maximum Media Segment
duration. The EXTINF duration of each Media Segment in the Playlist
file, when rounded to the nearest integer, MUST be less than or equal
to the target duration; longer segments can trigger playback stalls
or other errors. It applies to the entire Playlist file.
For some reason, the #EXT-X-TARGETDURATION in my manifest was set to 5 and I was seeing a 16-20s delay. After changing that value to 2, which is the correct one according to my segments' length, I am now seeing delays of 6-10s.
In summary, you should expect a delay of at least 3X your #EXT-X-TARGETDURATION. So, lowering the segment length and the #EXT-X-TARGETDURATION value should help reducing the delay.