Are there any protocols/standards on top of TCP optimized for high throughput and low latency? - language-agnostic

Are there any protocols/standards that work over TCP that are optimized for high throughput and low latency?
The only one I can think of is FAST.
At the moment I have devised just a simple text-based protocol delimited by special characters. I'd like to adopt a protocol which is designed for fast transfer and supports perhaps compression and minification of the data that travels over the TCP socket.

Instead of using heavy-weight TCP, we can utilize the connection-oriented/reliable feature of TCP on the top of UDP by any of the following way:
UDP-based Data Transfer Protocol(UDT):
UDT is built on top of User Datagram Protocol (UDP) by adding congestion control and reliability control mechanisms. UDT is an application level, connection oriented, duplex protocol that supports both reliable data streaming and partial reliable messaging.
Acknowledgment:
UDT uses periodic acknowledgments (ACK) to confirm packet delivery, while negative ACKs (loss reports) are used to report packet loss. Periodic ACKs help to reduce control traffic on the reverse path when the data transfer speed is high, because in these situations, the number of ACKs is proportional to time, rather than the number of data packets.
Reliable User Datagram Protocol (RUDP):
It aims to provide a solution where UDP is too primitive because guaranteed-order packet delivery is desirable, but TCP adds too much complexity/overhead.
It extends UDP by adding the following additional features:
Acknowledgment of received packets
Windowing and congestion control
Retransmission of lost packets
Overbuffering (Faster than real-time streaming)
en.wikipedia.org/wiki/Reliable_User_Datagram_Protocol

If layered on top of TCP, you won't get better throughput or latency than the 'barest' TCP connection.
there are other non-TCP high-throughput and/or low-latency connection-oriented protocols, usually layered on top of UDP.
almost the only one i know is UDT, which is optimized for networks where the high bandwidth or long round trip times (RTT) makes typical TCP retransmissions suboptimal. These are called 'extremely long fat networks' (LFN, pronounced 'elefan').

You may want to consider JMS. JMS can run on top of TCP, and you can get reasonable latency with a message broker like ActiveMQ.
It really depends on your target audience though. If your building a game which must run anywhere, you pretty much need to use HTTP or HTTP/Streaming. If you are pushing around market data on a LAN, than something NOT using TCP would probably suite you better. Tibco RV and JGroups both provide reliable low-latency messaging over multicast.

Just as you mentioned FAST - it is intended for market data distribution and is used by leading stock exchanges and is running on the top of UDP multicast.
In general, with current level of networks reliability it always worth putting your protocol on the top of UDP.
Whatever having session sequence number, NACK+server-to-client-heartbeat and binary marshalling should be close to theoretical performance.

If you have admin/root privilege on the sending side, you can also try a TCP acceleration driver like SuperTCP.

Related

Webrtc behavior Nack & FEC

We have WebRTC application with two peers and I experience packet loss of around 5% (checked on webrtc-internals) when call is ongoing. I see Nacks as well.
Wants to know if FEC is being implemented in my setup? I do see some SDP parameters related to FEC as below but not sure whether they are used or not.
How to check if Webrtc is using FEC?
a=rtpmap:124 red/90000
a=rtpmap:123 ulpfec/90000
Also is there any suggestions on how to improve packet loss percentage by tweaking Nacks or FEC etc?
Tried with different bandwidth and resolutions and packet loss is almost same.
Easiest way to determine whether FEC is actually used is to run a packet capture using Wireshark or tcpdump and look for RTP packets where the payload type matches the value in the SDP (123 and 124 in your example). If you see these packets, you’re seeing FEC.
One thing to note, FEC could make packet loss worse in some cases, essentially where you have bursts of back to back packets lost because of congestion. FEC is transmitting additional packets, which allows any one or two packets in a group to be lost and recovered from the additional packets.
Found the root cause for packet loss. It was related to setup on network switches. We are using dedicated leaseline and leaseline expects fixed 100Mbps duplex configuration instead of auto configuration on network switch ports. Due to auto configuration, the link went in to half duplex and hence FEC errors.

RTP fragmentation vs UDP fragmentation

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?
Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.
RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.
I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.
Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

Considerable delays between calling URLLoader, and seeing bytes on the wire

Context: An AS3 application using a BOSH solution to asynchronously send and receive messages from a server.
I am endeavouring to find the source of a delay between calls to URLLoader.load(), and observing the HTTP request on the wire with Wireshark. The delay is unpredictable and can be 30s or more. A reliable reproduction has not been possible.
I have been able to eliminate:
Network congestion: there is no TCP retransmission, no part of the journey to the server crosses a network that is the Internet or otherwise likely to suffer congestion.
Exceeding the HTTP transaction limit. At most there are two concurrent HTTP transactions.
Has anyone any wisdom they can share share?

Ordering of the messages

Does ZeroMQ guarantee the order of messages( FIFO ).
Is there an option for persistence.
Is it the best fit for IPC communications.
Does it allow prioritizing the messages.
Does it allow prioritizing the receivers.
Does it allow both synchronous as well asynchronous way of communication?
https://lists.zeromq.org/pipermail/zeromq-dev/2015-January/027748.html
The author said:"Messages carried over TCP or IPC will be delivered in order if they pass through the same network paths. This is guaranteed and it's a TCP
guarantee, nothing to do with ZeroMQ. ZeroMQ does not reorder
messages, ever. However if you pass messages through two or more
paths, and then merge those flows again, you will in effect shuffle
the messages."
Zeromq is best understood as a udp like messaging system. Thus is does not intrinsically guarantee any of that. It DOES guarantee that parts of a single message are received atomically and in order, since ZMQ allows for sending a message consisting of several parts. All communication is always asynchronous by design.
see http://zguide.zeromq.org/ for more advanced patterns.
that being said, all the features requested would by definition make the transmission slower and more complicated. If they are needed you should implement or use one of the available patterns of the guide.

TCP Slow Start, Congestion Avoidance & Determining Bandwidth

Is there a formula someplace which can be used to determine the minimum number of segments / bytes which need to be transfered across a TCP connection to determine it's bandwidth and which takes into account Slow Start and Congestion Avoidance? I'm aware of the pathrate tool, but I want if possible something a bit simpler that I can incorporate in an app to get a descent ballpark figure. One example of usage would be downloading some data from a webserver in order to determine the optimum number of threads for downloading a bunch of small files automatically. This is related to a previous question I posted: TCP, HTTP and the Multi-Threading Sweet Spot
You can fire up scholar.google.com and search for "TCP chirp". However, that requires hires timers, and if you don't write a kernel tcp congestion control algorithm, you'd have to reimplement TCP in userspace. And that by itself will probably not give good results (general purpose OS are not very good at realtime hires timer related stuff, runnning in userspace).
In theory, using TCP chirp you need as few as 4-5 segments (typically, you'd get better resolution with a longer train of segments) to determine the "optimal" bandwidth.
In any case, since you can not know which path is used (ie. satellite link or tv broadcast in the forward direction), you may need a considerable amount of data (10+ MB, perhaps even 1GB) to get a decent measurement over arbitrary paths. (Satellites can have many dozend MB/s bandwidth, but also latencies in the 1000-3000 ms range; and TCP takes a couple round-trip times to open up cwnd (I'd say around 10 RTTs before a measurement should be started)...
I do not think that there is a fixed number of bytes required to be sent to determine the bandwidth. This number can depend on network type and speed.
Bandwidth is a measure of some resource transferred over a time interval. To get real data you need to measure it. Here are some hints how to do that