RTP fragmentation vs UDP fragmentation - h.264

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?

Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.

RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.

I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.

Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

Related

How chrome://webrtc-internal measures the round trip time?

I have been analyzing the JSON file generated using chrome://webrtc-internal, while running webrtc on 2 PCS.
I looked at Stats API to verify how webrtc-internal computes the round trip time (RTT).
I found 2 ways:
RTC Remote Inbound RTP Video Stream that contains roundTripTime
RTC IceCandidate Pair that contains currentRoundTripTime.
Which one is accurate, why, and how is it computed?
Is RTT computed on a frame-by-frame basis?
Is it computed one way (sender --> receiver), or two ways (sender --> receiver--> sender)?
Which reports are used to measure the RTT? Is it Receiver Report RTCP or Sender Report RTCP?
What is the size of the length of GOP in the Webrtc VP8 codec?
RTCIceCandidatePairStats.currentRoundTripTime is computed by how long it takes for the remote peer to respond to STUN Binding Request. The WebRTC ICE Agent sends these on an interval, and each messages has a TransactionID.
RTCRemoteInboundRtpStreamStats.currentRoundTripTime is computed by how long since the last SenderReport has been received. The sender knows when it sent, so it is able to compute how long it took to arrive.
They are both accurate. Personally I use the ICE stats since there is less overhead. The packet doesn't have to be decrypted and routed through the RTCP subsystem. IMO ICE is also easier to deal with then RTCP.
What is the size of the length of GOP in the Webrtc VP8 codec?. It depends on what is being encoded and the settings. Do you have a low keyframe interval? Are you encoding something with lots of changes? What are you trying to determine with this question?

Understanding Websocket Frames in Chrome

When inspecting Websocket frames via Chromes' debug console, is the length field measuring the payload in bytes?
Obviously, it's the length of of the message. But, each character is one byte, right? If that is true, it's safe to say on my screenshot that 56, and 53 bytes were sent?
Yes, the length reported in Chrome is the length of the payload in bytes.
There is some additional overhead in the message itself beyond just what the payload length reports (both webSocket frame overhead and TCP/IP overhead, though it is fairly efficient in overhead). You can see the webSocket frame format here .
In your screenshot, 53 and 56 bytes of message payload were sent, but something a little larger than that went over the actual wire. You could count the characters in the data it reports was sent and that length should match the reported length. Keep in mind that TCP is a reliable protocol so there is extra TCP/IP protocol related to the reliable delivery of any packet, including ACKS sent back to confirm delivery, unique packet numbers, etc..., but that extra data is relatively small.

Does libpcap always make a copy of the packet?

I am writing monitoring program for a very high traffic network (HD videos are streamed through the network). Most packets are very large and I only want to watch the headers (IP and UDP/TCP only). Of course I want to avoid overhead of copying the entire data. Does libpcap necessarily give me a copy the whole packet? If yes, is there any library that matches my needs?
There appear to be two questions here:
the one in the title, which sounds as if it's asking whether libpcap copies the packet;
the one in the body, asking whether it always copies the entire packet.
For the first question:
There's probably at least one copy done by any code using the mechanisms atop which libpcap runs in various OSes - a copy from the mbufs/skbuff/STREAMS buffers/whatever to the mechanism's buffer. For Linux, when the tpacket mechanism is not being used, the skbuff might just be queued on the receive queue for the PF_PACKET socket libpcap is using.
There may be another copy - a copy from that buffer to userland; if libpcap is using a "zero-copy" mechanism, such as the Linux tpacket mechanism (which libpcap 1.0 and later use by default), the second copy doesn't happen. It will happen if a zero-copy mechanism isn't being used.
However, if you're using pcap_next() or pcap_next_ex() on a Linux system and the tpacket mechanism is being used, a separate copy, from the memory-mapped buffer to a private buffer; that doesn't happen if you use pcap_dispatch() or pcap_loop().
For the second question:
That's what the "snaplen" argument to pcap_open_live() and pcap_set_snaplen() is for - it lets you specify that no more than "snaplen" bytes of packet data should be captured, and that means that no more than that many bytes are copied.
Note that this length includes the link-layer headers, and that those can include "metadata" headers such as radiotap headers that you might get on 802.11 adapters. This header might be variable-length (for example, on 802.11, the 802.11 header is variable-length, and, if you're getting radiotap headers, those are variable-length as well).
In addition, both IPv4 and TCP headers can have options, and IPv6 packets can have extension headers, so the length of IP and TCP headers can also be variable.
This means that you might have to determine a "worst case" snapshot length to use; there's no way to explicitly say "don't give me anything past the TCP/UDP header", you can only say "give me no more than N bytes".

TCP Slow Start, Congestion Avoidance & Determining Bandwidth

Is there a formula someplace which can be used to determine the minimum number of segments / bytes which need to be transfered across a TCP connection to determine it's bandwidth and which takes into account Slow Start and Congestion Avoidance? I'm aware of the pathrate tool, but I want if possible something a bit simpler that I can incorporate in an app to get a descent ballpark figure. One example of usage would be downloading some data from a webserver in order to determine the optimum number of threads for downloading a bunch of small files automatically. This is related to a previous question I posted: TCP, HTTP and the Multi-Threading Sweet Spot
You can fire up scholar.google.com and search for "TCP chirp". However, that requires hires timers, and if you don't write a kernel tcp congestion control algorithm, you'd have to reimplement TCP in userspace. And that by itself will probably not give good results (general purpose OS are not very good at realtime hires timer related stuff, runnning in userspace).
In theory, using TCP chirp you need as few as 4-5 segments (typically, you'd get better resolution with a longer train of segments) to determine the "optimal" bandwidth.
In any case, since you can not know which path is used (ie. satellite link or tv broadcast in the forward direction), you may need a considerable amount of data (10+ MB, perhaps even 1GB) to get a decent measurement over arbitrary paths. (Satellites can have many dozend MB/s bandwidth, but also latencies in the 1000-3000 ms range; and TCP takes a couple round-trip times to open up cwnd (I'd say around 10 RTTs before a measurement should be started)...
I do not think that there is a fixed number of bytes required to be sent to determine the bandwidth. This number can depend on network type and speed.
Bandwidth is a measure of some resource transferred over a time interval. To get real data you need to measure it. Here are some hints how to do that

Are there any protocols/standards on top of TCP optimized for high throughput and low latency?

Are there any protocols/standards that work over TCP that are optimized for high throughput and low latency?
The only one I can think of is FAST.
At the moment I have devised just a simple text-based protocol delimited by special characters. I'd like to adopt a protocol which is designed for fast transfer and supports perhaps compression and minification of the data that travels over the TCP socket.
Instead of using heavy-weight TCP, we can utilize the connection-oriented/reliable feature of TCP on the top of UDP by any of the following way:
UDP-based Data Transfer Protocol(UDT):
UDT is built on top of User Datagram Protocol (UDP) by adding congestion control and reliability control mechanisms. UDT is an application level, connection oriented, duplex protocol that supports both reliable data streaming and partial reliable messaging.
Acknowledgment:
UDT uses periodic acknowledgments (ACK) to confirm packet delivery, while negative ACKs (loss reports) are used to report packet loss. Periodic ACKs help to reduce control traffic on the reverse path when the data transfer speed is high, because in these situations, the number of ACKs is proportional to time, rather than the number of data packets.
Reliable User Datagram Protocol (RUDP):
It aims to provide a solution where UDP is too primitive because guaranteed-order packet delivery is desirable, but TCP adds too much complexity/overhead.
It extends UDP by adding the following additional features:
Acknowledgment of received packets
Windowing and congestion control
Retransmission of lost packets
Overbuffering (Faster than real-time streaming)
en.wikipedia.org/wiki/Reliable_User_Datagram_Protocol
If layered on top of TCP, you won't get better throughput or latency than the 'barest' TCP connection.
there are other non-TCP high-throughput and/or low-latency connection-oriented protocols, usually layered on top of UDP.
almost the only one i know is UDT, which is optimized for networks where the high bandwidth or long round trip times (RTT) makes typical TCP retransmissions suboptimal. These are called 'extremely long fat networks' (LFN, pronounced 'elefan').
You may want to consider JMS. JMS can run on top of TCP, and you can get reasonable latency with a message broker like ActiveMQ.
It really depends on your target audience though. If your building a game which must run anywhere, you pretty much need to use HTTP or HTTP/Streaming. If you are pushing around market data on a LAN, than something NOT using TCP would probably suite you better. Tibco RV and JGroups both provide reliable low-latency messaging over multicast.
Just as you mentioned FAST - it is intended for market data distribution and is used by leading stock exchanges and is running on the top of UDP multicast.
In general, with current level of networks reliability it always worth putting your protocol on the top of UDP.
Whatever having session sequence number, NACK+server-to-client-heartbeat and binary marshalling should be close to theoretical performance.
If you have admin/root privilege on the sending side, you can also try a TCP acceleration driver like SuperTCP.