Is it possible to remove start codes using NVENC? - h.264

I'm using NVENC SDK to encode OpenGL frames and stream them over RTSP. NVENC gives me encoded data in the form of several NAL units. In order to stream them with Live555 I need to find the start code (0x00 0x00 0x01) and remove it. I want to avoid this operation.
NVENC has a sliceOffset attribute which I can consult, but it indicates slices, not NAL units. It only points the ending of the SPS and PPS headers, where the actual data starts. I understand that a slice is not equal to a NAL (correct me if I'm wrong). I'm already forcing single slices for encoded data.
Is any of the following possible?
Force NVENC to encode individual NAL units
Force NVENC to indicate where the NAL units in each encoded data block are
Make Live555 accept the sequence parameters for streaming

There seems to be a point where every person trying to do H.264 over RTSP/RTP comes down to this question. Well here are my two cents:
1) There is a concept of an access unit. An access unit is a set of NAL units (may be as well be only one) that represent an encoded frame. That is the level of logic you should work at. If you are saying that you want the encoder to give you individual NAL unit's, then what behavior do you expect when the encoding procedure results in multiple NAL units from one raw frame (e.g. SPS + PPS + coded picture). That being said, there are ways to configure the encoder to reduce the number of NAL units in an access unit (like not including the AUD NAL, not repeating SPS/PPS, exclude SEI NAL's) - with that knowledge you can actually know what to expect and kind of force the encoder to give you single NAL per frame (of course this will not work for all frames, but with the knowledge you have about the decoder you can handle that). I'm not an expert on the NVENC API, I've also just started using it, but at least as for Intel Quick Sync, turning off AUD,SEI and disabling repetition of PPS/SPS gave me roughly around 1 NAL per frame for frames 2...N.
2) Won't be able to answer this since as I mentioned I'm not familiar with the API but I highly doubt this.
3) SPS and PPS should be in the first access unit (the first bit-stream you get from the encoder) and you could just find the right NAL's in the bit-stream and extract them, or there may be a special API call to obtain them from the encoder.
All that being said, I don't think it is that hard to actually run through the bit-stream, parse the start codes and extract the NAL unit's and feed them to Live555 one by one. Of course, if the encoder offers to output the bit-stream in the AVCC format (compared to the start codes or Annex B it uses interleaved length value between the NAL units so you can just jump to the next one without looking for the prefix) then you should use it. When it is just RTP it's easy enough to implement the transport yourself, since I've had bad luck with GStreamer that did not have proper support for FU-A packetization, in case of RTSP the overhead of the transport infrastructure is bigger and it is reasonable to use a 3rd party library like Live555.

Related

Why use Base64 in IoT use cases?

I've noticed a lot of payload data encoded as Base64 before transmission in many IoT use cases. Particularly in LPWAN (LoRa, LTE-M, NBIoT Sigfox, etc).
For simplicity sake, sending JSON payloads makes a lot of sense. Also it's my understanding that Base64 encoding adds some additional weight to the payload size, which for low bandwidth use cases it seems counter intuitive.
Could someone explain the benefits of using Base64 in IoT (or otherwise) applications?
Thanks!
Well, base64 is generally used to encode binary formats. Since binary is the native representation of data in a computer, it is obviously the easiest format for resource-constrained embedded devices to handle. It's also reasonably compact.
As an algorithm, base64 is a fairly simple conceptually and requires very few resources to implement, so it's a good compromise for squeezing binary data through a text-only channel. Building a JSON record, on the other hand, typically requires a JSON library which consumes RAM and code space - not horribly much, but still more than base64.
Not to mention that the data channels you've mentioned are rather starved for bandwidth. E.g. public LoRaWAN deployments are notorious for permitting a device to send a few dozen bytes of data a few dozen times per day.
If I want to encode a data record consisting of, say a 32-bit timestamp, an 8-bit code specifying the type of data (i.e. temperature, voltage or pressure) and 32-bit data sample:
struct {
time_t time;
uint8_t type;
uint32_t value;
}
This will use 9 bytes. It grows to around 12 bytes after being encoded with base64.
Compare that with a simple JSON record which is 67 bytes after leaving out all whitespace:
{
"time": "2012-04-23T18:25:43.511Z",
"type": "temp",
"value": 26.94
}
So 12 B or 67 B - not much competition for bandwidth starved data channels. On a LoRaWAN link that could make the different between squeezing into your precious uplink slot 5-6 data records or 1 data record.
Regarding data compression - on a resource constrained embedded device it's much, much more practical to encode data as compact binary instead of transforming it into a verbose format and compressing that.

comparing h.264 encoding decoding performance

I am beginner of video codec. not an video codec expert
I just want to know base on the same criteria, Comparing H254 encoding/decoding which is more efficiency.
Thanks
Decoding is more efficient. To be useful, decoding must run in real time, where encoding does not (except in videophone / conferencing applications).
How much more efficient? An encoder can generate motion vectors. The more compute power used on generating those motion vectors, the more accurate they are. And, the more accurate they are, the more bandwidth is available for the difference frames, so the quality goes up.
So, the kind of encoding used to generate video for streaming or distribution on DVD or BD discs can run many times slower than real time on server farms. But decoding for that kind of program is useless unless it runs in real time.
Even in the case of real-time encoding it takes more power (actual milliwatts, compute cycles, etc) than decoding.
It's true of H.264, H.265, VP8, VP9, and other video codecs.

Re encode an audio stream recording on the fly?

Is it possible to rip an audio steam with Variable Bit Rate encoding and re encode it on the fly as it is being recorded with Constant Bit Rate encoding?
I am downloading an audio stream in AAC format with VBR encoding using cURL.
The duration of a VBR encoded file will be calculated by its byte length, resulting in a discrepancy for the duration on different players. The result of this duration discrepancy does not allow me to seek and slice precisely. I would need to re encode it somehow with a constant bit rate to get the seeking to work properly.
The audio stream is hours long so re encoding it afterwards takes way too much time and processing power.
Is there anything I can do about this?
Perhaps I can specify some settings in cURL to achieve a constant recording bit rate?

RTP fragmentation vs UDP fragmentation

I don't understand why we bother fragmenting at RTP level if UDP (or IP) layer does the fragmentation.
As I understand it, let's say we are on Ethernet link, the MTU is 1500 bytes.
If I have to send, for example, 3880 bytes, fragmenting at IP layer, would results in 3 packets of respectively 1500, 1500, and 940 bytes (IP header is 20 bytes, so the total overhead results in 60 bytes).
If I do it at UDP layer the overhead will be 84 bytes (3x 28 bytes).
At RTP layer it's 120 bytes of overhead.
At H264/NAL packetization layer, it's 3 more bytes (so 123 bytes final) for FU-A mode.
For such a small packet, it makes a final increase of 3.1% for the initial packet size, while at IP layer, it would only waste 1.5% overall.
Is there any valid reason to bother making such a complex packetization rules at RTP layer knowing it'd always be worse than lower layer fragmentation?
Except for the first fragment, fragmented IP traffic does not contain the source or destination port numbers. Instead it glues packets together using sequence IDs. This makes it impossible for stateless intermediate network devices (switches and routers) that need to re-install QoS (because .1p or DSCP flags were cleared by another device or never existed in the first place.) Unless the device has the resources to manage per-session state, it either has to risk rate-limiting/prioritizing fragments from unrelated streams, or not prioritizing any fragments, some of which can be voice/video.
AFAIK RTP packets never IP-fragment unless the network has MTU mismatches in it. Hence each UDP header has source and destination port numbers, so if you can tame your clients to use known port ranges, you can re-establish QoS markings based on this information, and you can pass IP fragments as vanilla traffic and not worry about dropping voice/video data.
RTP is designed with UDP in mind.
Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services; both protocols contribute parts of
the transport protocol functionality.
However RTP services that are added to raw UDP such as ability to detect packet reordering, losses and timing require that UDP data consists of RTP payload and also service information.
The Internet, like other packet networks, occasionally loses and
reorders packets and delays them by variable amounts of time. To cope
with these impairments, the RTP header contains timing information
and a sequence number that allow the receivers to reconstruct the
timing produced by the source, so that in this example, chunks of
audio are contiguously played out the speaker every 20 ms. This
timing reconstruction is performed separately for each source of RTP
packets in the conference. The sequence number can also be used by
the receiver to estimate how many packets are being lost.
Then RTP is designed to be extensible, common headers and data specific payload:
RTP is a protocol framework that is deliberately not complete. This document specifies those functions expected to be common across all the applications for which RTP would be appropriate. Unlike conventional protocols in which additional functions might be accommodated by making the protocol more general or by adding an option mechanism that would require
parsing, RTP is intended to be tailored through modifications and/or additions to the headers as needed.
All quotes are from RFC 1889 "RTP: A Transport Protocol for Real-Time Applications".
That is, RTP overhead for H.264 stream is not just a waste of bandwidth. RTP headers and H.264 payload formatting allow, at moderate cost, to handle video data streaming in a more reliable way, and in the same time to leverage specification which is well defined and good for different sorts of data.
I'd like to add that a lot of RTP servers/senders go about sending split datagrams inefficiently.
They use a lot of malloc/free in dynamic buffer contexts.
They also use one syscall per part of the message instead of message-vectors.
To add insult to injury they usually do a lot of time calculation / other handling between sending every part of the datagram.
This causes even more syscalls, sometimes even stretching the packet over a long time because they have no upper bound when the packet should be finished, only that it is finished before sending the next batch of packets.
Inefficient behavior like this gets seriously in the way if you want to scale throughput or on a low power embedded CPU. For bw, network and CPU efficiency reasons, it's usually way better to send the entire datagram in one go to the kernel and let it deal with fragmentation instead of userspace trying to figure it out.
Well, after a lot of thinking about this, there is no reason not to use IP based fragmentation up to 64kB (and this will happen if you have a lot of same timestamp's NAL unit you need to aggregate, via STAP-A for example).
The RFC6184 is clear, you can use up to 64kB of NAL this way since each NAL unit's size of exactly 2 bytes (16 bits) is appended before the actual NAL unit, although staying below the MTU is preferred.
What happen if the "single-time" NAL units cumulated size is larger than 64kB ? The RFC6184 does not say, but I guess you'll have to send all your NAL as separate FU-A packets without increasing the timestamp between them (this is where the only reason why the Start/End bit in the FU-A header is useful, since there is no more 1:1 match between the End bit and the RTP's marker bit).
The RFC states:
An aggregation packet can
carry as many aggregation units as necessary; however, the total
amount of data in an aggregation packet obviously MUST fit into an IP
packet, and the size SHOULD be chosen so that the resulting IP packet
is smaller than the MTU size
When a "single NAL per frame" is larger than the MTU (for example, 1460 bytes with Ethernet), it has to be split with a fragmentation unit packetization (for example, FU-A).
However, nothing in the RFC states that the limit should be 1460 bytes. And it makes sense to have larger than that when doing Ethernet only streaming (as computed above)
If you have a NAL unit larger than 64kB, then you must use FU-A to send it since you can not fit this in a single IP datagram.
The RFC states:
This payload type allows fragmenting a NAL unit into several RTP
packets. Doing so on the application layer instead of relying on
lower-layer fragmentation (e.g., by IP) has the following advantages:
o The payload format is capable of transporting NAL units bigger
than 64 kbytes over an IPv4 network that may be present in pre-
recorded video, particularly in High-Definition formats (there is
a limit of the number of slices per picture, which results in a
limit of NAL units per picture, which may result in big NAL
units).
o The fragmentation mechanism allows fragmenting a single NAL unit
and applying generic forward error correction as described in
Section 12.5.
Which I understand as: "If you NAL unit is less than 64kbytes, and you don't care about FEC, then don't use FU-A, but use a single RTP packet for it"
Another case where FU-A are necessary is when receiving a H264 stream with RTP over RTSP (interleaved mode). The "packet" size must fit in 2 bytes (16bits), so you also must fragment larger NAL unit even if send on a reliable stream socket.

How to determine last NAL of an Access Unit in H.264

When parsing NAL units from a H.264 source is it possible to determine the end of an Access Unit without having to find the start of the next one? I am aware of the following section in the H.264 spec:
7.4.1.2.4 Detection of the first VCL NAL unit of a primary coded picture
And I have currently implemented this. The problem here though, is that if there is a large time gap at the end of an Access Unit I won't 'get' the Access Unit until the start of the next one. Is there another way to determine the end (ie. last NAL) of an Access Unit?
I am also aware of the Marker Bit in the RTSP standard but it is not reliable enough for us to use. And in some cases it is just plain wrong.
no, I don't think so.
Unreliable marker bit is the only way to signal end of access unit (in case of RTP).
They should have handled it more reliably in h.264 payload (rfc 6184).
You can check for timestamps and sequence number to infer start of new AU but that is also unreliable (packet loss, reordering, need to wait for first packet of next AU)