Definition of Round Trip Time by using Ping ICMP messages - ping

How is the RTT defined by the use of a "simple" ping command?
Example (Win7):
ping -l 600 www.google.de
My understanding is:
There will be send a ICMP message to google with the size of 600 bytes (request). Google copies that message (600 bytes) and sends it back to the destination (reply).
The RTT is the (latency) time for the whole procedure involving the sending and the getting of the 600 byte message.
Is that right?

Latency is typically caused by mainly two reasons:
1) Distance between two Nodes; This plays a vital role in calculating latency. For example, consider a scenario where Node A and Node B need to communicate, sending ICMP messages to each other and vice-versa.
a) The fewer the number of hops, the lower the latency will be. More hops, more latency.
Solution: You can select an alternate path for the communication, maybe the path having less distance.
2) How busy the network is; Whenever packet is sent from one network to other, routers process the packets, which in turn takes some milliseconds doing so. It will add up all the time taken to and fro for calculating the latency.
a) It depends upon the process device, how busy it is. If less busy, packets will be processed and forwarded faster, if busy it will take time.
Solution: one possible solution can be using QOS where in you can prioritize the traffic, not ICMP traffic of course, some other kind of traffic.

Related

Massive ICMP ping fair use policy

I made a (Java) tool capable of issuing tens of ICMP pings per second towards individual hosts. For example for 1000 different hosts it takes about one minute to send the pings and wait for and collect any responses from the individual hosts.
The purpose of this tool would be to monitor periodically the functioning of network connectivity for a collection of remote hosts.
Am I free to push this tool to its limits in setting any high value of pings per second and in the total amount of hosts? Or should I restrict this in order to avoid me to be banned or blocked by anyone? How costly or annoying are ICMP Pings on networks?
Simple PING packet is 74 bytes long, including the Ethernet frame header and the IP + ICMP headers, plus the minimum 32 bytes of the ICMP payload - so it's not that big even 1000 of them.
But you should not use PING too much in my opinion. Network admins can detect such abnormal network behaviour and try to contact you or block your IP. Also IDS and routers can cut you off because of their policy.
The purpose of ICMP PING packets is to help network admins to diagnose network infrastructure problems. Typical use is to send a small number of packets to a target machine, like:
$ ping stackoverflow.com
Pinging stackoverflow.com [151.101.129.69] with 32 bytes of data:
Reply from 151.101.129.69: bytes=32 time=72ms TTL=57
Reply from 151.101.129.69: bytes=32 time=73ms TTL=57
Reply from 151.101.129.69: bytes=32 time=73ms TTL=57
Reply from 151.101.129.69: bytes=32 time=72ms TTL=57
Ping statistics for 151.101.129.69:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 72ms, Maximum = 73ms, Average = 72ms
As you can see the system ping command sends four packets - it's enough to diagnose the problem. Of course you can change the size of packets and their number.
In my opinion any other usage of ICMP PING (bigger number of packets, high speed of sending or a large size of packets) is a sign of abnormal usage. It is very often related to a virus/trojan/worm network infection, agressive port scanning by hackers or DDoS attack.
You want to send ~1000 packets - IMHO it's way to many. You should give a possibility to change this number.

How to resolve tcpdump dropped packets?

I am using tcpdump to capture network packets and running into issue when I start dropping packets. I ran an application which exchanges packets rapidly over network; resulting in high network bandwidth.
>> tcpdump -i eno1 -s 64 -B 919400
126716 packets captured
2821976 packets received by filter
167770 packets dropped by kernel
Since I am only interested in protocol related part from TCP packet; I want to collect TCP packets without data/payload. I hope this strategy can also help in capturing more packets before dropping packets. It appears that I can only increase buffer size (-B argument) upto certain limit. Even with higher limit I am dropping more packets than captured.
can you help me understanding above messages and questions I have
what are packets captured ?
what are packets received by filter?
what are packets dropped by kernel?
how can I capture all packets at high bandwidth without dropping any packets. My test application runs for 3 minutes and exchanges packets at a very high rate. I am only interested in protocol related information not in actual data/ payload being sent.
From Guy Harris himself:
the "packets captured" number is a number that's incremented every time tcpdump sees a packet, so it counts packets that tcpdump reads from libpcap and thus that libpcap reads from BPF and supplies to tcpdump.
The "packets received by filter" number is the "ps_recv" number from a call to pcap_stats(); with BPF, that's the bs_recv number from the BIOCGSTATS ioctl. That count includes all packets that were handed to BPF; those packets might still be in a buffer that hasn't yet been read by libpcap (and thus not handed to tcpdump), or might be in a buffer that's been read by libpcap but not yet handed to tcpdump, so it can count packets that aren't reported as "captured".
And from the tcpdump man page:
packets ``dropped by kernel'' (this is the number of packets that were dropped, due to a lack of buffer space, by the packet capture mechanism in the OS on which tcpdump is running, if the OS reports that information to applications; if not, it will be reported as 0).
To attempt to improve capture performance, here are a few things to try:
Don't capture in promiscuous mode if you don't need to. That will cut down on the amount of traffic that the kernel has to process. Do this by using the -p option.
Since you're only interested in TCP traffic, apply a capture expression that limits the traffic to TCP only. Do this by appending "tcp" to your command.
Try writing the packets to a file (or files to limit size) rather than displaying packets to the screen. Do this with the -w file option or look into the -C file_size and -G rotate_seconds options if you want to limit file sizes.
You could try to improve tcpdump's scheduling priority via nice.
From Wireshark's Performance wiki page:
stop other programs running on that machine, to remove system load
buy a bigger, faster machine :)
increase the buffer size (which you're already doing)
set a snap length (which you're already doing)
write capture files to a RAM disk
Try using PF_RING.
You could also try using dumpcap instead of tcpdump, although I would be surprised if the performance was drastically different.
You could try capturing with an external, dedicated device using a TAP or Switch+SPAN port. See Wireshark's Ethernet Capture Setup wiki page for ideas.
Another promising possibility: Capturing Packets in Linux at a Speed of Millions of Packets per Second without Using Third Party Libraries.
See also Andrew Brown's Sharkfest '14 Maximizing Packet Capture Performance document for still more ideas.
Good luck!
I would try actually lowering the value of your -B option.
The unit is 1 KiB (1024 bytes), thus the buffer size you specified (919400) is almost 1 gigabyte.
I suppose you would get better results by using a value closer to your CPU cache size, e.g. -B 16384.

RTP fragmentation vs UDP fragmentation

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?
Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.
RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.
I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.
Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

TCP Slow Start, Congestion Avoidance & Determining Bandwidth

Is there a formula someplace which can be used to determine the minimum number of segments / bytes which need to be transfered across a TCP connection to determine it's bandwidth and which takes into account Slow Start and Congestion Avoidance? I'm aware of the pathrate tool, but I want if possible something a bit simpler that I can incorporate in an app to get a descent ballpark figure. One example of usage would be downloading some data from a webserver in order to determine the optimum number of threads for downloading a bunch of small files automatically. This is related to a previous question I posted: TCP, HTTP and the Multi-Threading Sweet Spot
You can fire up scholar.google.com and search for "TCP chirp". However, that requires hires timers, and if you don't write a kernel tcp congestion control algorithm, you'd have to reimplement TCP in userspace. And that by itself will probably not give good results (general purpose OS are not very good at realtime hires timer related stuff, runnning in userspace).
In theory, using TCP chirp you need as few as 4-5 segments (typically, you'd get better resolution with a longer train of segments) to determine the "optimal" bandwidth.
In any case, since you can not know which path is used (ie. satellite link or tv broadcast in the forward direction), you may need a considerable amount of data (10+ MB, perhaps even 1GB) to get a decent measurement over arbitrary paths. (Satellites can have many dozend MB/s bandwidth, but also latencies in the 1000-3000 ms range; and TCP takes a couple round-trip times to open up cwnd (I'd say around 10 RTTs before a measurement should be started)...
I do not think that there is a fixed number of bytes required to be sent to determine the bandwidth. This number can depend on network type and speed.
Bandwidth is a measure of some resource transferred over a time interval. To get real data you need to measure it. Here are some hints how to do that

Are there any protocols/standards on top of TCP optimized for high throughput and low latency?

Are there any protocols/standards that work over TCP that are optimized for high throughput and low latency?
The only one I can think of is FAST.
At the moment I have devised just a simple text-based protocol delimited by special characters. I'd like to adopt a protocol which is designed for fast transfer and supports perhaps compression and minification of the data that travels over the TCP socket.
Instead of using heavy-weight TCP, we can utilize the connection-oriented/reliable feature of TCP on the top of UDP by any of the following way:
UDP-based Data Transfer Protocol(UDT):
UDT is built on top of User Datagram Protocol (UDP) by adding congestion control and reliability control mechanisms. UDT is an application level, connection oriented, duplex protocol that supports both reliable data streaming and partial reliable messaging.
Acknowledgment:
UDT uses periodic acknowledgments (ACK) to confirm packet delivery, while negative ACKs (loss reports) are used to report packet loss. Periodic ACKs help to reduce control traffic on the reverse path when the data transfer speed is high, because in these situations, the number of ACKs is proportional to time, rather than the number of data packets.
Reliable User Datagram Protocol (RUDP):
It aims to provide a solution where UDP is too primitive because guaranteed-order packet delivery is desirable, but TCP adds too much complexity/overhead.
It extends UDP by adding the following additional features:
Acknowledgment of received packets
Windowing and congestion control
Retransmission of lost packets
Overbuffering (Faster than real-time streaming)
en.wikipedia.org/wiki/Reliable_User_Datagram_Protocol
If layered on top of TCP, you won't get better throughput or latency than the 'barest' TCP connection.
there are other non-TCP high-throughput and/or low-latency connection-oriented protocols, usually layered on top of UDP.
almost the only one i know is UDT, which is optimized for networks where the high bandwidth or long round trip times (RTT) makes typical TCP retransmissions suboptimal. These are called 'extremely long fat networks' (LFN, pronounced 'elefan').
You may want to consider JMS. JMS can run on top of TCP, and you can get reasonable latency with a message broker like ActiveMQ.
It really depends on your target audience though. If your building a game which must run anywhere, you pretty much need to use HTTP or HTTP/Streaming. If you are pushing around market data on a LAN, than something NOT using TCP would probably suite you better. Tibco RV and JGroups both provide reliable low-latency messaging over multicast.
Just as you mentioned FAST - it is intended for market data distribution and is used by leading stock exchanges and is running on the top of UDP multicast.
In general, with current level of networks reliability it always worth putting your protocol on the top of UDP.
Whatever having session sequence number, NACK+server-to-client-heartbeat and binary marshalling should be close to theoretical performance.
If you have admin/root privilege on the sending side, you can also try a TCP acceleration driver like SuperTCP.