How to determine last NAL of an Access Unit in H.264 - h.264

When parsing NAL units from a H.264 source is it possible to determine the end of an Access Unit without having to find the start of the next one? I am aware of the following section in the H.264 spec:
7.4.1.2.4 Detection of the first VCL NAL unit of a primary coded picture
And I have currently implemented this. The problem here though, is that if there is a large time gap at the end of an Access Unit I won't 'get' the Access Unit until the start of the next one. Is there another way to determine the end (ie. last NAL) of an Access Unit?
I am also aware of the Marker Bit in the RTSP standard but it is not reliable enough for us to use. And in some cases it is just plain wrong.

no, I don't think so.
Unreliable marker bit is the only way to signal end of access unit (in case of RTP).
They should have handled it more reliably in h.264 payload (rfc 6184).
You can check for timestamps and sequence number to infer start of new AU but that is also unreliable (packet loss, reordering, need to wait for first packet of next AU)

Related

How chrome://webrtc-internal measures the round trip time?

I have been analyzing the JSON file generated using chrome://webrtc-internal, while running webrtc on 2 PCS.
I looked at Stats API to verify how webrtc-internal computes the round trip time (RTT).
I found 2 ways:
RTC Remote Inbound RTP Video Stream that contains roundTripTime
RTC IceCandidate Pair that contains currentRoundTripTime.
Which one is accurate, why, and how is it computed?
Is RTT computed on a frame-by-frame basis?
Is it computed one way (sender --> receiver), or two ways (sender --> receiver--> sender)?
Which reports are used to measure the RTT? Is it Receiver Report RTCP or Sender Report RTCP?
What is the size of the length of GOP in the Webrtc VP8 codec?
RTCIceCandidatePairStats.currentRoundTripTime is computed by how long it takes for the remote peer to respond to STUN Binding Request. The WebRTC ICE Agent sends these on an interval, and each messages has a TransactionID.
RTCRemoteInboundRtpStreamStats.currentRoundTripTime is computed by how long since the last SenderReport has been received. The sender knows when it sent, so it is able to compute how long it took to arrive.
They are both accurate. Personally I use the ICE stats since there is less overhead. The packet doesn't have to be decrypted and routed through the RTCP subsystem. IMO ICE is also easier to deal with then RTCP.
What is the size of the length of GOP in the Webrtc VP8 codec?. It depends on what is being encoded and the settings. Do you have a low keyframe interval? Are you encoding something with lots of changes? What are you trying to determine with this question?

How do you derive walltime from timestamp using Chrome's debugger protocol?

I've been building a Chrome extension using in part the Chrome debugger protocol.
Certain events in the Network domain like requestWillBeSent include a "timestamp" as well as a "wallTime."
The walltime is a regular seconds since 1970 format, but the timestamp is in seconds but its not clear where its 0 is, and many events have no wallTime so I'm really trying to figure out how to derive wallTime from timeStamp.
Based on this I believed to be based on the navigationStart value but that did not yield the correct date based on either the background page of the extension's navigationStart nor the page where the event originated navigationStart.
Is it possible at all to use timestamp to get at the wallTime or am I out of luck?
According to source code in InspectorNetworkAgent.cpp:
wallTime is currentTime() (normal system time)
timestamp is monotonicallyIncreasingTime()
On Windows it's based on the number of milliseconds that have elapsed since the system was started, and you can't get that info from an extension.
On POSIX systems (e.g. Linux) clock_gettime in CLOCK_MONOTONIC mode is used that represents monotonic time since some unspecified starting point.
According to source code in time.h:
TimeTicks and ThreadTicks represent an abstract time that is most of the time
incrementing, for use in measuring time durations. Internally, they are
represented in microseconds. They can not be converted to a human-readable
time, but are guaranteed not to decrease (unlike the Time class). Note that
TimeTicks may "stand still" (e.g., if the computer is suspended), and
ThreadTicks will "stand still" whenever the thread has been de-scheduled by
the operating system.

RTP fragmentation vs UDP fragmentation

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?
Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.
RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.
I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.
Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

Is it possible to remove start codes using NVENC?

I'm using NVENC SDK to encode OpenGL frames and stream them over RTSP. NVENC gives me encoded data in the form of several NAL units. In order to stream them with Live555 I need to find the start code (0x00 0x00 0x01) and remove it. I want to avoid this operation.
NVENC has a sliceOffset attribute which I can consult, but it indicates slices, not NAL units. It only points the ending of the SPS and PPS headers, where the actual data starts. I understand that a slice is not equal to a NAL (correct me if I'm wrong). I'm already forcing single slices for encoded data.
Is any of the following possible?
Force NVENC to encode individual NAL units
Force NVENC to indicate where the NAL units in each encoded data block are
Make Live555 accept the sequence parameters for streaming
There seems to be a point where every person trying to do H.264 over RTSP/RTP comes down to this question. Well here are my two cents:
1) There is a concept of an access unit. An access unit is a set of NAL units (may be as well be only one) that represent an encoded frame. That is the level of logic you should work at. If you are saying that you want the encoder to give you individual NAL unit's, then what behavior do you expect when the encoding procedure results in multiple NAL units from one raw frame (e.g. SPS + PPS + coded picture). That being said, there are ways to configure the encoder to reduce the number of NAL units in an access unit (like not including the AUD NAL, not repeating SPS/PPS, exclude SEI NAL's) - with that knowledge you can actually know what to expect and kind of force the encoder to give you single NAL per frame (of course this will not work for all frames, but with the knowledge you have about the decoder you can handle that). I'm not an expert on the NVENC API, I've also just started using it, but at least as for Intel Quick Sync, turning off AUD,SEI and disabling repetition of PPS/SPS gave me roughly around 1 NAL per frame for frames 2...N.
2) Won't be able to answer this since as I mentioned I'm not familiar with the API but I highly doubt this.
3) SPS and PPS should be in the first access unit (the first bit-stream you get from the encoder) and you could just find the right NAL's in the bit-stream and extract them, or there may be a special API call to obtain them from the encoder.
All that being said, I don't think it is that hard to actually run through the bit-stream, parse the start codes and extract the NAL unit's and feed them to Live555 one by one. Of course, if the encoder offers to output the bit-stream in the AVCC format (compared to the start codes or Annex B it uses interleaved length value between the NAL units so you can just jump to the next one without looking for the prefix) then you should use it. When it is just RTP it's easy enough to implement the transport yourself, since I've had bad luck with GStreamer that did not have proper support for FU-A packetization, in case of RTSP the overhead of the transport infrastructure is bigger and it is reasonable to use a 3rd party library like Live555.

Flex/Actionscript determine if NetStream has audio by analysing audioBytesPerSecond

I need to "look" at a NetStream and determine if I'm receiving audio. From what I investigated, i may use the property audioBytesPerSecond from NetStreamInfo:
"(audioBytesPerSecond) Specifies the rate at which the NetStream audio
buffer is filled in bytes per second. The value is calculated as a
smooth average for the audio data received in the last second."
I also learned that NetStream may have contain some overhead bytes from the network so, which is the minimum reasonable audioBytesPerSecond value to determine if NetStream is playing audio (and not just noise, for example)?
Can this analysis be done this way?
Thanks in advance!
Yes you can do it this way. It's rather subjective, however.
Try to find a threshold that works for you. We used 5 kilobits/sec in the past. If the amount of data falls below this value, they are likely not sending any audio. Note, we were using the stream.info.byteCount property (you might want a slightly lower value if you're using auiodBytesPerSecond).
This is pretty easy to observe if you speak into the microphone and periodically check audioBytesPerSecond or the other counters/statistics that are available.