Basic architecture to serve, stream and consume large audio files to minimize client-side resource consumption and latency - html5-audio

I am trying to build a web application which will need to have audio streaming functionality implemented in some way. Just to give you guys some context: It is designed to be a purely auditive experience/game/idkhowtocallit with lots of different sound assets varying in length and thus file size. The sound assets to be provided will consist of ambient sounds, spoken bits of conversation, but also long music sets (up to a couple of hours). Why I think I won't be able to just host these audio files on some server or CDN and serve them from there is, because the sound assets will need to be fetched and played dynamically (depending on user interaction) and as instantly as possible.
Most importantly, consuming larger files (like the music sets and long ambient loops) as a whole doesn't seem to be client-friendly at all to me (used data consumption on mobile networks and client-side memory usage).
Also, without any buffering or streaming mechanism, the client won't be able to start playing these files before they are downloaded completely, right? Which would add the issue of high latencies.
I've tried to do some online research on how to properly implement a good infrastructure to stream bigger audio files to clients on the server side and found HLS and MPEG-DASH. I have some experience with consuming HLS players with web players and if I understand it correctly, I would use some sort of one-time transformation process (on or after file upload) to split up the files into chunks and create the playlist and then just serve these files via HTTP. From what I understand the process should be more or less the same for MPEG-DASH. My issue with these two techniques is that I couldn't really find any documentation on how to implement JavaScript/TypeScript clients (particularly using the Web Audio API) without reinventing the wheel. My best guess would be to use something like hls.js and bind the HLS streams to freshly created audio elements and use these elements to create AudioSources in my Web Audio Graph. How far off am I? I'm trying to get at least an idea of a best practice.
To sum up what I would really appreciate to get some clarity about:
Would HLS or MPEG-DASH really be the way to go or am I missing a more basic chunked file streaming mechanism with good libraries?
How - theoretically - would I go about limiting the amount of chunks downloaded in advance on the client side to save client-side resources, which is one of my biggest concerns?
I was looking into hosting services as well, but figured that most of them are specialized in hosting podcasts (fewer but very large files). Has anyone an opinion about whether I could use these services to host and stream possibly 1000s of files with sizes ranging from very small to rather large?
Thank you so much in advance to everyone who will be bothered with helping me out. Really appreciate it.

Why I think I won't be able to just host these audio files on some server or CDN and serve them from there is, because the sound assets will need to be fetched and played dynamically (depending on user interaction) and as instantly as possible.
Your long running ambient sounds can stream, using a normal HTMLAudioElement. When you play them, there may be a little lag time before they start since they have to begin streaming, but note that the browser will generally prefetch the metadata and maybe even the beginning of the media data.
For short sounds where latency is critical (like one-shot user interaction sound effects), load those into buffers with the Web Audio API for playback. You won't be able to stream them, but they'll play as instantly as you can get.
Most importantly, consuming larger files (like the music sets and long ambient loops) as a whole doesn't seem to be client-friendly at all to me (used data consumption on mobile networks and client-side memory usage).
If you want to play the audio, you naturally have to download that audio. You can't play something you haven't loaded in some way. If you use an audio element, you won't be downloading much more than what is being played. And, that downloading is mostly going to occur on-demand.
Also, without any buffering or streaming mechanism, the client won't be able to start playing these files before they are downloaded completely, right? Which would add the issue of high latencies.
If you use an audio element, the browser takes care of all the buffering and what not for you. You don't have to worry about it.
I've tried to do some online research on how to properly implement a good infrastructure to stream bigger audio files to clients on the server side and found HLS and MPEG-DASH.
If you're only streaming a single bitrate (which for audio is usually fine) and you're not streaming live content, then there's no point to HLS or DASH here.
Would HLS or MPEG-DASH really be the way to go or am I missing a more basic chunked file streaming mechanism with good libraries?
The browser will make ranged HTTP requests to get the data it needs out of the regular static media file. You don't need to do anything special to stream it. Just make sure your server is configured to handle ranged requests... most any should be able to do this right out of the box.
How - theoretically - would I go about limiting the amount of chunks downloaded in advance on the client side to save client-side resources, which is one of my biggest concerns?
The browser does this for you if you use an audio element. Additionally, data saving settings and the detected connectivity speed may impact whether or not the browser pre-fetches. The point is, you don't have to worry about this. You'll only be using what you need.
Just make sure you're compressing your media as efficiently as you can for the required audio quality. Use a good codec like Opus or AAC.
I was looking into hosting services as well, but figured that most of them are specialized in hosting podcasts (fewer but very large files). Has anyone an opinion about whether I could use these services to host and stream possibly 1000s of files with sizes ranging from very small to rather large?
Most any regular HTTP CDN will work just fine.
One final note for you... beware of iOS and Safari. Thanks to Apple's restrictive policies, all browsers under iOS are effectively Safari. Safari is incapable of playing more than one audio element at a time. If you use the Web Audio API you have more flexibility, but the Web Audio API has no real provision for streaming. You can use a media element source node, but this breaks lock screen metadata and outright doesn't work on some older versions of iOS. TL;DR; Safari is all but useless for audio on the web, and Apple's business practices have broken any alternatives.

Related

Is there a way to offer multiple video qualities (resolutions) without uploading multiple videos in HTML5 video player?

I'm trying to add a few videos to my website using HTML5. My videos are all 1080, but I want to give people the option to watch in a lower quality if needed. Can I do this without having to upload multiple videos (1 for each quality) without the usage of a server-side language?
I've been extensively searching for this. Haven't find anyone say that it can't be done, but no one said it can either. I am using Blogger as my host, which is why I am can't use server-side languages.
Thank you.
without the usage of a server-side language?
Yes, of course. The client can choose what version of the video to download.
Can I do this without having to upload multiple videos (1 for each quality)
Not practically, no. You need to transcode that video and upload those different versions.
Haven't find anyone say that it can't be done
A couple things to consider... first is that a video file can contain many streams. I don't know what your aversion is to multiple files, but yes it is possible to have several bitrates of video in a single container. A single MP4, for example, could easily contain a 768 kbps video, a 2 Mbps video, and an 8 Mbps video, while having a single 256 kbps audio track.
To play such a file, a client (implemented with Media Source Extensions and the Fetch API) would need to know how to parse the container and make ranged requests for specific chunks out of the file. To my knowledge, no such client exists as there's little point to it when you can simply use DASH and/or HLS. The browser certainly doesn't do this work for you.
Some video codecs, like H.264, support the concept of scaling. The idea here is that rather than having multiple encodings, there's just one where additional data enhances the previous video that was sent. There is significant overhead with this mechanism, and even more work you'd have to do. Not only does your code now need to understand the container, but now it has to handle the codec in use as well... and it needs to do it efficiently.
To summarize, is it possible to use one file? Technically, yes. Is there any benefit? None. Is there anything off-the-shelf for this? No.
Edit: I see now your comment that the issue is one of storage space. You should really put that information in your question so you can get a useful answer.
It's common for YouTube and others to transcode things ahead of time. This is particularly useful for videos that get a ton of traffic, as the segments can be stored on the CDN, with nodes closer to the clients. Yes, it's also possible to transcode on-demand as well. You need fast hardware for this.
No.
I can't fathom how this could ever be possible. Do you have an angle in mind?
Clients can either download all or part(s) of a file. But to do this you would have to somehow download only select pixels of each frame. Even if you had knowledge of which byte-ranges of each frame were which pixels, the overhead involved in requesting each byte-range would be greater than the size of the full 1080p video.
What is your aversion to hosting multiple qualities? Is it about storage space, or complexity/time of conversion?

Current best practice to stream live video in web browser?

We develop an IP camera product which streams H.264/MPEG4/MJPEG video via RTSP/UDP. It has a web interface, currently we use the VLC Firefox plugin to allow viewing of the live RTSP stream in the browser but Firefox are dropping support for NPAPI plugins so that's currently a dead end.
The camera itself is a relatively low-powered ARM SoC (think Raspberry Pi level) so we don't have vast spare resource to do things like transcode streams on-the-fly on the board.
The main purpose is to check the video stream is working correctly from the web interface, so streaming a new stream (or transcoding it) in some other format/transport/streaming engine is less desirable than being able to somehow play the original RTSP stream directly. In regular use the video is streamed via RTSP into a VMS server so that's not up for alteration.
In an ideal world the solution would be open-source cross-browser and happen inside an HTML5 tag, but if it works in one or more of the most popular browsers we'll take it.
I've been reading all sorts of stuff here and around the web about the brave new world of the HTML5 video tag, WebRTC, HLS, etc. and have yet to see anything that looks like a sensible and complete solution that doesn't involve some extra conversion/transcoding/re-streaming, often by some half-supported framework or an extra server in the middle which is not a viable solution.
I haven't yet found a proper description of what may or may not be required to "convert" our stream to whatever-html5-video-likes, whether it's just a slightly different wrapper around the same basic video stream or if there's a lot of overhead and everything is different. Likewise it's not clear if the conversion could be achieved either on-board or perhaps even in-browser using JS.
The reason for the title is that if we've got to change the way it all works we may as well aim to do whatever is considered "best practice" and reasonably future-proof as far as possible rather than some expedient fudge that might not work beyond the next round of browser updates / the next W3C press release...
I find it slightly disappointing (but perhaps not surprising) that in 2017 there seems to be no sensible way of achieving this.
Perhaps "least worst practice" would be more suitable terminology...
There are many methods you can use that don't require transcoding.
WebRTC
If you're using RTSP, you're much of the way there in sending your streams via WebRTC.
WebRTC uses SDP for declaring streams, and RTP for the transport of these streams. There are some other layers you need for setting up the WebRTC call, but none of these require particularly expensive computation. Most (all?) WebRTC clients will support H.264 decoding, many with hardware acceleration in-browser.
The easiest way to get started with WebRTC is to implement a browser-to-browser client first. Then, you can go a layer deeper with your own implementation.
WebRTC is the route I recommend to you. NAT traversal (in most cases) and P2P connectivity are built-in, so your customers won't have to remember IP addresses. Simply provide signalling services and your customers can connect directly to their cameras at home from wherever. Provide TURN servers, and they'll be able to connect even if both ends are firewalled. If you don't wish to provide such services, they're lightweight and can run directly on the camera in a mode like you have today.
Fragmented MP4 over HTTP Progressive with <video> tag
This method is much simpler than WebRTC, but totally different than what you're doing now. You can take your H.264 stream, and wrap it directly in an MP4 without transcoding. Then, it can be played in a <video> tag on a page. You'll have to implement the appropriate libs in your code, but here's an FFmpeg example that outputs to STDOUT, which you'd pipe to clients:
ffmpeg \
-i YOUR_CAMERA_HERE \
-vcodec copy \
-acodec copy \
-f mp4 \
-movflags frag_keyframe+empty_moov \
-
Others...
In your case, there's no added benefit to DASH. DASH is intended for utilizing file-based CDNs for streaming. You control the server, so there's no point in writing out files or handling HTTP requests in a file-like manner. While you can certainly use DASH with H.264 streams without transcoding, I think it's a waste of your time.
HLS is much the same. Your stream is compatible with HLS, but HLS is dropping out of favor rapidly due to its lack of flexibility on codec. DASH and HLS are essentially the same mechanism... write a bunch of media segments to a CDN and create a playlist or manifest indicating where they are.
Well, I had to do the same thing while back in a raspberry pi 3. we transcoded it on the fly using ffmpeg on the pi and used https://github.com/phoboslab/jsmpeg to stream mjpeg. then played it on the browser/ionic app.
var canvas = document.getElementById('video-canvas');
this.player = new JSMpeg.Player(this.button.url ,{canvas: canvas});
We were managing up to 4 concurrent streams with minimum delay <2-5 secs on our Pis.
But once we moved to React Native we used the RN VLC wrapper on the phones

Web Audio Streaming/Seeking (Mobile + Desktop) with Tracking

It's been a long time since I've needed to build a streaming media solution for the web since there are so many good services providing basic needs. Ages ago I had done streaming deployments with Red5 and a frontend player like Flow or JW, but despite a lot of searching I haven't been able to determine what the best modern options are to achieve a simple web solution that has good native compatibility with mobile and desktop. I'm hoping someone can suggest a good open source stack that would be a good fit for what I'm trying to do:
Mobile + Desktop Web Audio Support (with the possibility of some video at some point)
Streaming (not live on-air streaming, but the ability to seek and listen without downloading the whole file and only buffering enough to for good quality as to not waste bandwidth)
Track how much of a stream has been listened to (not the amount downloaded)
Protecting streams (this should be pretty easy either using headers or URL patterns that expire the link/connection after a certain point of time by writing that logic into the server side but I wanted to mention it anyways)
Lightweight, maybe 10 to 20 concurrent audio streams for now.
My initial thought, since my knowledge in the streaming space is dated, was to do it with RTMP, RTSP or HLS and something like JWPlayer for the frontend, depending on which protocol seemed to have the best support across the most devices. Or, if the client side support is good I could probably skip the media server and go with psuedo-streaming via the webserver. On the server side I could handle the access control to the streams and on the front end player I could use the API to keep track of the amount of seconds played and just ping that data in a ajax request to keep track of how much has actually been listened too (as opposed to downloaded/buffered).
The thing is I have a feeling that is probably an obsolete way to go about it in 2017. Even though I haven't turned up anything specific yet, I had a feeling perhaps there was a better solution out there utilizing Node and perhaps websockets to accomplish the same thing with a server and player coupled more tightly (server aware of playhead/actual amount played vs front end only) or a solution that leverages newer standards (MPEG-DASH?). Plus, in the past getting the other streaming protocols to be compatible across all devices was a pain and IIRC there is no universal HTML5 support for any given one of the older protocols across all browsers.
So what are the best current approaches to tackle something like this? Is a media server + a front end like JW or Flow still one of the better ways to go or are there better/easier ways to deploy a solution like this?

Low latency (< 2s) live video streaming HTML5 solutions?

With Chrome disabling Flash by default very soon I need to start looking into flash/rtmp html5 replacement solutions.
Currently with Flash + RTMP I have a live video stream with < 1-2 second delay.
I've experimented with MPEG-DASH which seems to be the new industry standard for streaming but that came up short with 5 second delay being the best I could squeeze from it.
For context, I am trying to allow user's to control physical objects they can see on the stream, so anything above a couple of seconds of delay leads to a frustrating experience.
Are there any other techniques, or is there really no low latency html5 solutions for live streaming yet?
Technologies and Requirements
The only web-based technology set really geared toward low latency is WebRTC. It's built for video conferencing. Codecs are tuned for low latency over quality. Bitrates are usually variable, opting for a stable connection over quality.
However, you don't necessarily need this low latency optimization for all of your users. In fact, from what I can gather on your requirements, low latency for everyone will hurt the user experience. While your users in control of the robot definitely need low latency video so they can reasonably control it, the users not in control don't have this requirement and can instead opt for reliable higher quality video.
How to Set it Up
In-Control Users to Robot Connection
Users controlling the robot will load a page that utilizes some WebRTC components for connecting to the camera and control server. To facilitate WebRTC connections, you need some sort of STUN server. To get around NAT and other firewall restrictions, you may need a TURN server. Both of these are usually built into Node.js-based WebRTC frameworks.
The cam/control server will also need to connect via WebRTC. Honestly, the easiest way to do this is to make your controlling application somewhat web based. Since you're using Node.js already, check out NW.js or Electron. Both can take advantage of the WebRTC capabilities already built in WebKit, while still giving you the flexibility to do whatever you'd like with Node.js.
The in-control users and the cam/control server will make a peer-to-peer connection via WebRTC (or TURN server if required). From there, you'll want to open up a media channel as well as a data channel. The data side can be used to send your robot commands. The media channel will of course be used for the low latency video stream being sent back to the in-control users.
Again, it's important to note that the video that will be sent back will be optimized for latency, not quality. This sort of connection also ensures a fast response to your commands.
Video for Viewing Users
Users that are simply viewing the stream and not controlling the robot can use normal video distribution methods. It is actually very important for you to use an existing CDN and transcoding services, since you will have 10k-15k people watching the stream. With that many users, you're probably going to want your video in a couple different codecs, and certainly a whole array of bitrates. Distribution with DASH or HLS is easiest to work with at the moment, and frees you of Flash requirements.
You will probably also want to send your stream to social media services. This is another reason why it's important to start with a high quality HD stream. Those services will transcode your video again, reducing quality. If you start with good quality first, you'll end up with better quality in the end.
Metadata (chat, control signals, etc.)
It isn't clear from your requirements what sort of metadata you need, but for small message-based data, you can use a web socket library, such as Socket.IO. As you scale this up to a few instances, you can use pub/sub, such as Redis, to distribution messaging throughout the servers.
To synchronize the metadata to the video depends a bit on what's in that metadata and what the synchronization requirement is, specifically. Generally speaking, you can assume that there will be a reasonable but unpredictable delay between the source video and the clients. After all, you cannot control how long they will buffer. Each device is different, each connection variable. What you can assume is that playback will begin with the first segment the client downloads. In other words, if a client starts buffering a video and begins playing it 2 seconds later, the video is 2 seconds behind from when the first request was made.
Detecting when playback actually begins client-side is possible. Since the server knows the timestamp for which video was sent to the client, it can inform the client of its offset relative to the beginning of video playback. Since you'll probably be using DASH or HLS and you need to use MCE with AJAX to get the data anyway, you can use the response headers in the segment response to indicate the timestamp for the beginning the segment. The client can then synchronize itself. Let me break this down step-by-step:
Client starts receiving metadata messages from application server.
Client requests the first video segment from the CDN.
CDN server replies with video segment. In the response headers, the Date: header can indicate the exact date/time for the start of the segment.
Client reads the response Date: header (let's say 2016-06-01 20:31:00). Client continues buffering the segments.
Client starts buffering/playback as normal.
Playback starts. Client can detect this state change on the player and knows that 00:00:00 on the video player is actualy 2016-06-01 20:31:00.
Client displays metadata synchronized with the video, dropping any messages from previous times and buffering any for future times.
This should meet your needs and give you the flexibility to do whatever you need to with your video going forward.
Why not [magic-technology-here]?
When you choose low latency, you lose quality. Quality comes from available bandwidth. Bandwidth efficiency comes from being able to buffer and optimize entire sequences of images when encoding. If you wanted perfect quality (lossless for each image) you would need a ton (gigabites per viewer) of bandwidth. That's why we have these lossy codecs to begin with.
Since you don't actually need low latency for most of your viewers, it's better to optimize for quality for them.
For the 2 users out of 15,000 that do need low latency, we can optimize for low latency for them. They will get substandard video quality, but will be able to actively control a robot, which is awesome!
Always remember that the internet is a hostile place where nothing works quite as well as it should. System resources and bandwidth are constantly variable. That's actually why WebRTC auto-adjusts (as best as reasonable) to changing conditions.
Not all connections can keep up with low latency requirements. That's why every single low latency connection will experience drop-outs. The internet is packet-switched, not circuit-switched. There is no real dedicated bandwidth available.
Having a large buffer (a couple seconds) allows clients to survive momentary losses of connections. It's why CD players with anti-skip buffers were created, and sold very well. It's a far better user experience for those 15,000 users if the video works correctly. They don't have to know that they are 5-10 seconds behind the main stream, but they will definitely know if the video drops out every other second.
There are tradeoffs in every approach. I think what I have outlined here separates the concerns and gives you the best tradeoffs in each area. Please feel free to ask for clarification or ask follow-up questions in the comments.

Streaming adaptive audio on the web (low latency)

I am attempting to implement a streaming audio solution for the web. My requirements are these:
Relatively low latency (no more than 2 seconds).
Streaming in a compressed format (Ogg Vorbis/MP3) to save on bandwidth.
The stream is generated on the fly and is unique for each client.
To clarify the last point, my case does not fit the usual pattern of having a stream being generated somewhere and then broadcast to the clients using something like Shoutcast. The stream is dynamic and will adapt based on client input which I handle separately using regular http requests to the same server.
Initially I looked at streaming Vorbis/MP3 as http chunks for use with the html5 audio tag, but after some more research I found a lot of people who say that the audio tag has pretty high latency which disqualifies it for this project.
I also looked into Emscripten which would allow me to play audio using SDL2, but the prospect of decoding Vorbis and MP3 in the browser is not too appealing.
I am looking to implement the server in C++ (probably using the asynchronous facilities of boost.asio), and to have as small a codebase as possible for playback in the browser (the more the browser does implicitly the better). Can anyone recommend a solution?
P.S. I have no problem implementing streaming protocol support from scratch in C++ if there are no ready to use libraries that fit the bill.
You should look into Media Source Extension.
Introduction: http://en.wikipedia.org/wiki/Media_Source_Extensions
Specification: https://w3c.github.io/media-source/