Low latency (< 2s) live video streaming HTML5 solutions? - html

With Chrome disabling Flash by default very soon I need to start looking into flash/rtmp html5 replacement solutions.
Currently with Flash + RTMP I have a live video stream with < 1-2 second delay.
I've experimented with MPEG-DASH which seems to be the new industry standard for streaming but that came up short with 5 second delay being the best I could squeeze from it.
For context, I am trying to allow user's to control physical objects they can see on the stream, so anything above a couple of seconds of delay leads to a frustrating experience.
Are there any other techniques, or is there really no low latency html5 solutions for live streaming yet?

Technologies and Requirements
The only web-based technology set really geared toward low latency is WebRTC. It's built for video conferencing. Codecs are tuned for low latency over quality. Bitrates are usually variable, opting for a stable connection over quality.
However, you don't necessarily need this low latency optimization for all of your users. In fact, from what I can gather on your requirements, low latency for everyone will hurt the user experience. While your users in control of the robot definitely need low latency video so they can reasonably control it, the users not in control don't have this requirement and can instead opt for reliable higher quality video.
How to Set it Up
In-Control Users to Robot Connection
Users controlling the robot will load a page that utilizes some WebRTC components for connecting to the camera and control server. To facilitate WebRTC connections, you need some sort of STUN server. To get around NAT and other firewall restrictions, you may need a TURN server. Both of these are usually built into Node.js-based WebRTC frameworks.
The cam/control server will also need to connect via WebRTC. Honestly, the easiest way to do this is to make your controlling application somewhat web based. Since you're using Node.js already, check out NW.js or Electron. Both can take advantage of the WebRTC capabilities already built in WebKit, while still giving you the flexibility to do whatever you'd like with Node.js.
The in-control users and the cam/control server will make a peer-to-peer connection via WebRTC (or TURN server if required). From there, you'll want to open up a media channel as well as a data channel. The data side can be used to send your robot commands. The media channel will of course be used for the low latency video stream being sent back to the in-control users.
Again, it's important to note that the video that will be sent back will be optimized for latency, not quality. This sort of connection also ensures a fast response to your commands.
Video for Viewing Users
Users that are simply viewing the stream and not controlling the robot can use normal video distribution methods. It is actually very important for you to use an existing CDN and transcoding services, since you will have 10k-15k people watching the stream. With that many users, you're probably going to want your video in a couple different codecs, and certainly a whole array of bitrates. Distribution with DASH or HLS is easiest to work with at the moment, and frees you of Flash requirements.
You will probably also want to send your stream to social media services. This is another reason why it's important to start with a high quality HD stream. Those services will transcode your video again, reducing quality. If you start with good quality first, you'll end up with better quality in the end.
Metadata (chat, control signals, etc.)
It isn't clear from your requirements what sort of metadata you need, but for small message-based data, you can use a web socket library, such as Socket.IO. As you scale this up to a few instances, you can use pub/sub, such as Redis, to distribution messaging throughout the servers.
To synchronize the metadata to the video depends a bit on what's in that metadata and what the synchronization requirement is, specifically. Generally speaking, you can assume that there will be a reasonable but unpredictable delay between the source video and the clients. After all, you cannot control how long they will buffer. Each device is different, each connection variable. What you can assume is that playback will begin with the first segment the client downloads. In other words, if a client starts buffering a video and begins playing it 2 seconds later, the video is 2 seconds behind from when the first request was made.
Detecting when playback actually begins client-side is possible. Since the server knows the timestamp for which video was sent to the client, it can inform the client of its offset relative to the beginning of video playback. Since you'll probably be using DASH or HLS and you need to use MCE with AJAX to get the data anyway, you can use the response headers in the segment response to indicate the timestamp for the beginning the segment. The client can then synchronize itself. Let me break this down step-by-step:
Client starts receiving metadata messages from application server.
Client requests the first video segment from the CDN.
CDN server replies with video segment. In the response headers, the Date: header can indicate the exact date/time for the start of the segment.
Client reads the response Date: header (let's say 2016-06-01 20:31:00). Client continues buffering the segments.
Client starts buffering/playback as normal.
Playback starts. Client can detect this state change on the player and knows that 00:00:00 on the video player is actualy 2016-06-01 20:31:00.
Client displays metadata synchronized with the video, dropping any messages from previous times and buffering any for future times.
This should meet your needs and give you the flexibility to do whatever you need to with your video going forward.
Why not [magic-technology-here]?
When you choose low latency, you lose quality. Quality comes from available bandwidth. Bandwidth efficiency comes from being able to buffer and optimize entire sequences of images when encoding. If you wanted perfect quality (lossless for each image) you would need a ton (gigabites per viewer) of bandwidth. That's why we have these lossy codecs to begin with.
Since you don't actually need low latency for most of your viewers, it's better to optimize for quality for them.
For the 2 users out of 15,000 that do need low latency, we can optimize for low latency for them. They will get substandard video quality, but will be able to actively control a robot, which is awesome!
Always remember that the internet is a hostile place where nothing works quite as well as it should. System resources and bandwidth are constantly variable. That's actually why WebRTC auto-adjusts (as best as reasonable) to changing conditions.
Not all connections can keep up with low latency requirements. That's why every single low latency connection will experience drop-outs. The internet is packet-switched, not circuit-switched. There is no real dedicated bandwidth available.
Having a large buffer (a couple seconds) allows clients to survive momentary losses of connections. It's why CD players with anti-skip buffers were created, and sold very well. It's a far better user experience for those 15,000 users if the video works correctly. They don't have to know that they are 5-10 seconds behind the main stream, but they will definitely know if the video drops out every other second.
There are tradeoffs in every approach. I think what I have outlined here separates the concerns and gives you the best tradeoffs in each area. Please feel free to ask for clarification or ask follow-up questions in the comments.

Related

Basic architecture to serve, stream and consume large audio files to minimize client-side resource consumption and latency

I am trying to build a web application which will need to have audio streaming functionality implemented in some way. Just to give you guys some context: It is designed to be a purely auditive experience/game/idkhowtocallit with lots of different sound assets varying in length and thus file size. The sound assets to be provided will consist of ambient sounds, spoken bits of conversation, but also long music sets (up to a couple of hours). Why I think I won't be able to just host these audio files on some server or CDN and serve them from there is, because the sound assets will need to be fetched and played dynamically (depending on user interaction) and as instantly as possible.
Most importantly, consuming larger files (like the music sets and long ambient loops) as a whole doesn't seem to be client-friendly at all to me (used data consumption on mobile networks and client-side memory usage).
Also, without any buffering or streaming mechanism, the client won't be able to start playing these files before they are downloaded completely, right? Which would add the issue of high latencies.
I've tried to do some online research on how to properly implement a good infrastructure to stream bigger audio files to clients on the server side and found HLS and MPEG-DASH. I have some experience with consuming HLS players with web players and if I understand it correctly, I would use some sort of one-time transformation process (on or after file upload) to split up the files into chunks and create the playlist and then just serve these files via HTTP. From what I understand the process should be more or less the same for MPEG-DASH. My issue with these two techniques is that I couldn't really find any documentation on how to implement JavaScript/TypeScript clients (particularly using the Web Audio API) without reinventing the wheel. My best guess would be to use something like hls.js and bind the HLS streams to freshly created audio elements and use these elements to create AudioSources in my Web Audio Graph. How far off am I? I'm trying to get at least an idea of a best practice.
To sum up what I would really appreciate to get some clarity about:
Would HLS or MPEG-DASH really be the way to go or am I missing a more basic chunked file streaming mechanism with good libraries?
How - theoretically - would I go about limiting the amount of chunks downloaded in advance on the client side to save client-side resources, which is one of my biggest concerns?
I was looking into hosting services as well, but figured that most of them are specialized in hosting podcasts (fewer but very large files). Has anyone an opinion about whether I could use these services to host and stream possibly 1000s of files with sizes ranging from very small to rather large?
Thank you so much in advance to everyone who will be bothered with helping me out. Really appreciate it.
Why I think I won't be able to just host these audio files on some server or CDN and serve them from there is, because the sound assets will need to be fetched and played dynamically (depending on user interaction) and as instantly as possible.
Your long running ambient sounds can stream, using a normal HTMLAudioElement. When you play them, there may be a little lag time before they start since they have to begin streaming, but note that the browser will generally prefetch the metadata and maybe even the beginning of the media data.
For short sounds where latency is critical (like one-shot user interaction sound effects), load those into buffers with the Web Audio API for playback. You won't be able to stream them, but they'll play as instantly as you can get.
Most importantly, consuming larger files (like the music sets and long ambient loops) as a whole doesn't seem to be client-friendly at all to me (used data consumption on mobile networks and client-side memory usage).
If you want to play the audio, you naturally have to download that audio. You can't play something you haven't loaded in some way. If you use an audio element, you won't be downloading much more than what is being played. And, that downloading is mostly going to occur on-demand.
Also, without any buffering or streaming mechanism, the client won't be able to start playing these files before they are downloaded completely, right? Which would add the issue of high latencies.
If you use an audio element, the browser takes care of all the buffering and what not for you. You don't have to worry about it.
I've tried to do some online research on how to properly implement a good infrastructure to stream bigger audio files to clients on the server side and found HLS and MPEG-DASH.
If you're only streaming a single bitrate (which for audio is usually fine) and you're not streaming live content, then there's no point to HLS or DASH here.
Would HLS or MPEG-DASH really be the way to go or am I missing a more basic chunked file streaming mechanism with good libraries?
The browser will make ranged HTTP requests to get the data it needs out of the regular static media file. You don't need to do anything special to stream it. Just make sure your server is configured to handle ranged requests... most any should be able to do this right out of the box.
How - theoretically - would I go about limiting the amount of chunks downloaded in advance on the client side to save client-side resources, which is one of my biggest concerns?
The browser does this for you if you use an audio element. Additionally, data saving settings and the detected connectivity speed may impact whether or not the browser pre-fetches. The point is, you don't have to worry about this. You'll only be using what you need.
Just make sure you're compressing your media as efficiently as you can for the required audio quality. Use a good codec like Opus or AAC.
I was looking into hosting services as well, but figured that most of them are specialized in hosting podcasts (fewer but very large files). Has anyone an opinion about whether I could use these services to host and stream possibly 1000s of files with sizes ranging from very small to rather large?
Most any regular HTTP CDN will work just fine.
One final note for you... beware of iOS and Safari. Thanks to Apple's restrictive policies, all browsers under iOS are effectively Safari. Safari is incapable of playing more than one audio element at a time. If you use the Web Audio API you have more flexibility, but the Web Audio API has no real provision for streaming. You can use a media element source node, but this breaks lock screen metadata and outright doesn't work on some older versions of iOS. TL;DR; Safari is all but useless for audio on the web, and Apple's business practices have broken any alternatives.

Sending and Receiving PCM Samples

I've been working on making a near real-time voice chat app. The webpage will send packets to the server, and the server will save the packets to disk, and then re transmit the packets to the other connected webpages. I've tried many other solutions, but they are either laggy or they do not play. I've realized that sending PCM samples would be optimal (the server will be recording these as well), but I'm not sure how to get them to play on another client's end. I'm using NodeJS with Socket.IO. Thanks in advance!
The webpage will send packets to the server, and the server will save the packets to disk, and then re transmit the packets to the other connected webpages.
Already, this is not that efficient. It's better when possible to send data directly from peer to peer.
I've realized that sending PCM samples would be optimal
No, it wouldn't. This requires more bandwidth, which is going to need better buffering, which means higher latency. This is voice chat... no need to use a lossless encoding like PCM.
I've been working on making a near real-time voice chat app.
This is basically the defacto primary use case that WebRTC was built for. If you use WebRTC, you get:
Peer-to-peer streaming (where possible)
NAT traversal (to enable those P2P connections, where possible, or proxy them when not)
Low latency optimization, from end to end
Hardware acceleration (where available)
Opus audio codec
Automatic resampling, for compatibility and to keep latency low as things drop
In other words, this is already a solved problem with WebRTC.

Web Audio Streaming/Seeking (Mobile + Desktop) with Tracking

It's been a long time since I've needed to build a streaming media solution for the web since there are so many good services providing basic needs. Ages ago I had done streaming deployments with Red5 and a frontend player like Flow or JW, but despite a lot of searching I haven't been able to determine what the best modern options are to achieve a simple web solution that has good native compatibility with mobile and desktop. I'm hoping someone can suggest a good open source stack that would be a good fit for what I'm trying to do:
Mobile + Desktop Web Audio Support (with the possibility of some video at some point)
Streaming (not live on-air streaming, but the ability to seek and listen without downloading the whole file and only buffering enough to for good quality as to not waste bandwidth)
Track how much of a stream has been listened to (not the amount downloaded)
Protecting streams (this should be pretty easy either using headers or URL patterns that expire the link/connection after a certain point of time by writing that logic into the server side but I wanted to mention it anyways)
Lightweight, maybe 10 to 20 concurrent audio streams for now.
My initial thought, since my knowledge in the streaming space is dated, was to do it with RTMP, RTSP or HLS and something like JWPlayer for the frontend, depending on which protocol seemed to have the best support across the most devices. Or, if the client side support is good I could probably skip the media server and go with psuedo-streaming via the webserver. On the server side I could handle the access control to the streams and on the front end player I could use the API to keep track of the amount of seconds played and just ping that data in a ajax request to keep track of how much has actually been listened too (as opposed to downloaded/buffered).
The thing is I have a feeling that is probably an obsolete way to go about it in 2017. Even though I haven't turned up anything specific yet, I had a feeling perhaps there was a better solution out there utilizing Node and perhaps websockets to accomplish the same thing with a server and player coupled more tightly (server aware of playhead/actual amount played vs front end only) or a solution that leverages newer standards (MPEG-DASH?). Plus, in the past getting the other streaming protocols to be compatible across all devices was a pain and IIRC there is no universal HTML5 support for any given one of the older protocols across all browsers.
So what are the best current approaches to tackle something like this? Is a media server + a front end like JW or Flow still one of the better ways to go or are there better/easier ways to deploy a solution like this?

Live video streaming webrtc?

So I want to be able to let one client distribute live video feed to alot of other clients only watching. Is it possible to use WebRTC for this? Or will I basically have to go through a service like ustream or whatever.
It is possible, with a caveat. One client can establish many outgoing P2P connections to every viewer and stream video to them directly. However, for more than a very small handful of viewers this will quickly saturate the origin's bandwidth and possibly CPU. You would not be able to serve many viewers this way; however this would work entirely without middleman.*
* Safe for the WebRTC connection negotiation server.
To be able to serve a larger number of viewers you'll have to use a centralised distribution server. The source would send exactly one video stream to that server, and the server would stream it to anyone who's interested. This still requires that server to have a beefy CPU and a lot of bandwidth; but this is more realistic to scale than the client side.
You may need to spend a lot of money on such a server; have a look at c3.xlarge and better instances at AWS to get an idea. Using an established, cheap infrastructure like ustream may indeed be the more realistic option.

Getting WebRTC performance indication

We've developed a professional WebRTC application and are trying to give users an indication of how many streams their PC can handle (2-7). Is there an easy way of figuring this out (in browser or with a separate application)?
It's a conference application we offer to users browsing with Chrome.
Another question, if you work with for example 7 streams, are they divided over the different CPU cores? Or is the whole WebRTC deal included in the process for that browser tab?
WebRTC makes extensive use of threads, so it can utilize more than one core especially in multi-party conferences.
The simplest way to check is to make calls to yourself (each one = 2 calls in a mesh conference). If it's an MCU-style conference (likely with 7 participants), you need to simulate a one-way call (so you're doing one encode), plus decode N additional VP8 streams at "appropriate" resolutions.
This is complicated by Firefox, for example, using content analysis to selectively reduce resolution and/or frame rate of sent-video depending on load and outgoing bandwidth. However, for your case, it's more one of reception.
The short answer, though, is that it's hard to be sure, and will depend on the other senders too.