TCP Slow Start, Congestion Avoidance & Determining Bandwidth - language-agnostic

Is there a formula someplace which can be used to determine the minimum number of segments / bytes which need to be transfered across a TCP connection to determine it's bandwidth and which takes into account Slow Start and Congestion Avoidance? I'm aware of the pathrate tool, but I want if possible something a bit simpler that I can incorporate in an app to get a descent ballpark figure. One example of usage would be downloading some data from a webserver in order to determine the optimum number of threads for downloading a bunch of small files automatically. This is related to a previous question I posted: TCP, HTTP and the Multi-Threading Sweet Spot

You can fire up scholar.google.com and search for "TCP chirp". However, that requires hires timers, and if you don't write a kernel tcp congestion control algorithm, you'd have to reimplement TCP in userspace. And that by itself will probably not give good results (general purpose OS are not very good at realtime hires timer related stuff, runnning in userspace).
In theory, using TCP chirp you need as few as 4-5 segments (typically, you'd get better resolution with a longer train of segments) to determine the "optimal" bandwidth.
In any case, since you can not know which path is used (ie. satellite link or tv broadcast in the forward direction), you may need a considerable amount of data (10+ MB, perhaps even 1GB) to get a decent measurement over arbitrary paths. (Satellites can have many dozend MB/s bandwidth, but also latencies in the 1000-3000 ms range; and TCP takes a couple round-trip times to open up cwnd (I'd say around 10 RTTs before a measurement should be started)...

I do not think that there is a fixed number of bytes required to be sent to determine the bandwidth. This number can depend on network type and speed.
Bandwidth is a measure of some resource transferred over a time interval. To get real data you need to measure it. Here are some hints how to do that

Related

How to correctly calculate RAM requirement for a bucket in Couchbase

We have a bucket of about 34 million items in a Couchbase cluster setup of 6 AWS nodes. The bucket has been allocated 32.1GB of RAM (5482MB per node) and is currently using 29.1GB. If I use the formula provided in the Couchbase documentation (http://docs.couchbase.com/admin/admin/Concepts/bp-sizingGuidelines.html) it should use approx. 8.94GB of RAM.
Am I calculating it incorrectly? Below is link to google spreadsheet with all the details.
https://docs.google.com/spreadsheets/d/1b9XQn030TBCurUjv3bkhiHJ_aahepaBmFg_lJQj-EzQ/edit?usp=sharing
Assuming that you indeed have a working set of 0.5%, which as Kirk pointed out in his comment, is odd but not impossible, then you are calculating the result of the memory sizing formula correctly. However, it's important to understand that the formula is not a hard and fast rule that fits all situations. Rather, it's a general guideline and serves as a good starting point for you to go and begin your performance tests. Also, keep in mind that the RAM sizing isn't the only consideration for deciding on cluster size, because you also have to consider data safety, total disk write throughput, network bandwidth, CPU, how much a single node failure affects the rest of the cluster, and more.
Using the result of the RAM sizing formula as a starting point, you should now actually test whether your working assumptions were correct. Which means putting real (or close to representative) load on the bucket and seeing whether the % of cache misses is low enough and the operation lacency is within your acceptable limits. There is no general rule for this, what's acceptable to some applications might be too slow for others.
Just as an example, if you see that under load your cache miss ratio is 5% and while the average read latency is 3ms, the top 1% latency is 100ms - then you have to consider whether having one out of every 100 reads take that much longer is acceptable in your application. If it is - great, if not - you need to start increasing the RAM size until it matches your actual working set. Similarly, you should keep an eye on the disk throughput, CPU usage, etc.

RTP fragmentation vs UDP fragmentation

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?
Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.
RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.
I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.
Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

Handling lots of req / sec in go or nodejs

I'm developing a web app that needs to handle bursts of very high loads,
once per minute I get a burst of requests in very few seconds (~1M-3M/sec) and then for the rest of the minute I get nothing,
What's my best strategy to handle as many req /sec as possible at each front server, just sending a reply and storing the request in memory somehow to be processed in the background by the DB writer worker later ?
The aim is to do as less as possible during the burst, and write the requests to the DB ASAP after the burst.
Edit : the order of transactions in not important,
we can lose some transactions but 99% need to be recorded
latency of getting all requests to the DB can be a few seconds after then last request has been received. Lets say not more than 15 seconds
This question is kind of vague. But I'll take a stab at it.
1) You need limits. A simple implementation will open millions of connections to the DB, which will obviously perform badly. At the very least, each connection eats MB of RAM on the DB. Even with connection pooling, each 'thread' could take a lot of RAM to record it's (incoming) state.
If your app server had a limited number of processing threads, you can use HAProxy to "pick up the phone" and buffer the request in a queue for a few seconds until there is a free thread on your app server to handle the request.
In fact, you could just use a web server like nginx to take the request and say "200 OK". Then later, a simple app reads the web log and inserts into DB. This will scale pretty well, although you probably want one thread reading the log and several threads inserting.
2) If your language has coroutines, it may be better to handle the buffering yourself. You should measure the overhead of relying on our language runtime for scheduling.
For example, if each HTTP request is 1K of headers + data, want to parse it and throw away everything but the one or two pieces of data that you actually need (i.e. the DB ID). If you rely on your language coroutines as an 'implicit' queue, it will have 1K buffers for each coroutine while they are being parsed. In some cases, it's more efficient/faster to have a finite number of workers, and manage the queue explicitly. When you have a million things to do, small overheads add up quickly, and the language runtime won't always be optimized for your app.
Also, Go will give you far better control over your memory than Node.js. (Structs are much smaller than objects. The 'overhead' for the Keys to your struct is a compile-time thing for Go, but a run-time thing for Node.js)
3) How do you know it's working? You want to be able to know exactly how you are doing. When you rely on the language co-routines, it's not easy to ask "how many threads of execution do I have and what's the oldest one?" If you make an explicit queue, those questions are much easier to ask. (Imagine a handful of workers putting stuff in the queue, and a handful of workers pulling stuff out. There is a little uncertainty around the edges, but the queue in the middle very explicitly captures your backlog. You can easily calculate things like "drain rate" and "max memory usage" which are very important to knowing how overloaded you are.)
My advice: Go with Go. Long term, Go will be a much better choice. The Go runtime is a bit immature right now, but every release is getting better. Node.js is probably slightly ahead in a few areas (maturity, size of community, libraries, etc.)
How about a channel with a buffer size equal to what the DB writer can handle in 15 seconds? When the request comes in, it is sent on the channel. If the channel is full, give some sort of "System Overloaded" error response.
Then the DB writer reads from the channel and writes to the database.

Reliably monitor a serial port (Nortel CS1000)

I have a custom python script that monitors the call logs from a Nortel phone system. This phone system is under extremely high volume throughout the day and it's starting to appear that some records may be getting lost.
Some of you may dislike this, but I'm not interested in sharing the source code or current method in any way. I would rather consider this from a "new project" approach.
I'm looking for insight into the easiest and safest way to reliably monitor heavy data output through a serial port on Linux. I'm not limiting this to any particular set of tools or languages, I want to find out what works best to do this one critical job. I'm comfortable enough parsing the data and inserting it into mysql that we could just assume the data could be dropped to a text file.
Thank you
Well, the way that I would approach this this to have 2 threads (or processes) working.
Thread 1: The read thread
This thread does nothing but read data from the raw serial port and put the data into a local buffer/queue (In memory is preferred for speed). It should do nothing else. Depending on the clock speed of the serial connection, this should be pretty easy to do.
Thread2: The processing thread
This thread just sleeps until there is data in the local buffer to process, then reads and processes it. That's it.
The reason for splitting it apart in two, is so that if one is busy (a block in MySQL for the processing thread) it won't affect the other. After all, while the serial port is buffered by the OS, the buffer size is limited.
But then again, any local program is likely going to be way faster than the serial port can send data. Serial transfer is actually quite slow relative to the clock speed of the processor (115.2kbps is about the limit on standard hardware). So unless you're CPU speed bound (such as on an Arduino), I can't see normal conditions affecting it too much. So your choice of language really shouldn't be of too much concern (assuming modern hardware). Stick to what you know.

How can I model this usage scenario?

I want to create a fairly simple mathematical model that describes usage patterns and performance trade-offs in a system.
The system behaves as follows:
clients periodically issue multi-cast packets to a network of hosts
any host that receives the packet, responds with a unicast answer directly
the initiating host caches the responses for some given time period, then discards them
if the cache is full the next time a request is required, data is pulled from the cache not the network
packets are of a fixed size and always contain the same information
hosts are symmetic - any host can issue a request and respond to requests
I want to produce some simple mathematical models (and graphs) that describe the trade-offs available given some changes to the above system:
What happens where you vary the amount of time a host caches responses? How much data does this save? How many calls to the network do you avoid? (clearly depends on activity)
Suppose responses are also multi-cast, and any host that overhears another client's request can cache all the responses it hears - thereby saving itself potentially making a network request - how would this affect the overall state of the system?
Now, this one gets a bit more complicated - each request-response cycle alters the state of one other host in the network, so the more activity the quicker caches become invalid. How do I model the trade off between the number of hosts, the rate of activity, the "dirtyness" of the caches (assuming hosts listen in to other's responses) and how this changes with cache validity period? Not sure where to begin.
I don't really know what sort of mathematical model I need, or how I construct it. Clearly it's easier to just vary two parameters, but particularly with the last one, I've got maybe four variables changing that I want to explore.
Help and advice appreciated.
Investigate tokenised Petri nets. These seem to be an appropriate tool as they:
provide a graphical representation of the models
provide substantial mathematical analysis
have a large body of prior work and underlying analysis
are (relatively) simple mathematical models
seem to be directly tied to your problem in that they deal with constraint dependent networks that pass tokens only under specified conditions
I found a number of references (quality not assessed) by a search on "token Petri net"