Interactive streaming: it’s a party, and everyone is invited

Clubhouse has been the center of attention since the beginning of 2021, bringing a refreshingly new interactive experience to the world of live streaming.

We want human contact, or at the very least the digital equivalent of human contact: interactive exchanges.

It’s not that interaction is impossible, many tools out there propose interactive videos. But most of those are professional workplace tools made for collaboration and efficiency (Zoom, Google Meet, Microsoft Teams), not designed to reproduce the impromptu exchanges at the coffee machine or water cooler.

The real genius of Clubhouse is the architecture that promotes engagement by design, and through specific layers:

  1. Stage (a few fully interactive individuals)
  2. Audience (close to the stage and can ask for the Mic to participate live)
  3. Following (less engaged, but still interested)
  4. Live and Now

We see the number of people involved scale up as the interactivity scales down, while not blocking users throughout these layers. Stand up and go on stage, or drop the mic and sit down. Brilliant!

Clubhouse and other recently launched competing technologies are audio-only. Obviously, the next step is to add video to provide an even more engaging user experience.

Most of the traditional social apps followed this evolution path (WhatsApp, Viber, Facebook Messenger). Time is of the essence as it is likely that most of the big social media apps who already have video capabilities are now looking to implement Clubhouse’s engagement model.

Interactive streaming has been a common recurring request this year from Millicast customers, and we have been working for the last several months to provide a solution that is both elegant and fits nicely with our proven, reliable, fast and scalable architecture.

We are happy to announce that Millicast Interactivity is now a reality.

This is how we have achieved it:

Multisource Streams

When you publish a live feed to Millicast (either by WebRTC or RTMP) using the current API, any previously published live feed is overwritten seamlessly, so viewers will always watch the most recently published stream.

This is done in order to allow reconnections or to replace a malfunctioning encoder before the previously published live feed is disconnected due to a timeout.

In other words, Millicast only allowed a single audio and video feed to be active for any given stream.

With the new multisource feature, Millicast is now able to bundle different independent publication feeds (each of one identified with a different source id) under the same stream, which enables multiple audio and video tracks from different sources to be available to viewers.

The multisource feature is currently only supported with WebRTC sources. The publication of each WebRTC feed is handled normally by just adding a sourceId attribute when sending the publish command to Millicast. In the case where no sourceId is present, the stream will be treated as the default one to ensure backward compatibility. That means that you can use an RTMP publication as the main stream and multiple WebRTC publications with different source ids.

The reconnection feature is still supported within the same source, so if you publish a feed with the same source id as an existing one, the latest media source will be sent to viewers.

Audio Multiplexing

Once you have multiple sources being published as described above, the next step is to receive the sources within the viewer for playback.

By default, the Millicast servers will always negotiate a stream with an audio and video track corresponding to the main source (i.e. the one without a source id) in order to keep backward compatibility with the streams not using multisource.

You can specify the main source stream by using the pinnedSourceId attribute in the view command.

In order to be able to dynamically receive the other sources within a stream, Millicast has implemented a new feature called audio multiplexing. This feature allows the viewer to setup a number of audio tracks when starting the stream view and receive the active sources within the viewer.

The Millicast Viewer node will track the voice activity of each of the incoming audio sources on behalf of the viewer, and decide which are the most active ones, forwarding the audio data in one of the tracks that the viewer has already opened. Each time a track changes the incoming source, the server multiplexes and a “vad” event will be sent to the viewer so it is possible for the application to know exactly which source is being multiplexed in each track at any given time.

As many applications will both publish an audio source and view a stream, it is possible to send a list of excluded source ids within the view command that will not be tracked or multiplexed by the viewer.

It is worth noting that by choosing to multiplex instead of audio mixing, Millicast is able to fully support End-to-End Encryption (E2EE) with interactive streaming since the audio never needs to be processed or re-encoded.

In order to improve the performance of the feature and to avoid incurring higher bandwidth costs, it is recommended to enable dtx (discontinuous transmission) on the publishing side, so audio data is only sent when a user’s voice is detected.

Implementing a Clubhouse-like application

Now, we are going to demonstrate how it is possible to implement an application similar to Clubhouse by combining the features described in the previous section.

The Millicast platform provides you with all of the media functionality required for the application, but does not provide you with a hosted “room” or the logic specific to this Clubhouse use case.

But to make things even easier for you, we have created a demo application and server that you can download from Github and host yourself:

And also an hosted an online version so you can easily demo it yourself: https://spaces.millicast.com

Setup media publishing and viewing

The first step is to setup a wildcard publish and subscribe token for all of your streams and configure them on your server:

The demo server is based on socket.io and will be in charge of handling all of the room and participant logic. The wildcard tokens will be used by the server to make a request to the Millicast Director API in order to get the final jwt token for publishing and viewing a source for each participant.

When a user creates a new room, the socket.io server will assign it an unique identifier that will be used as the Millicast stream name (the final stream id is your account id and the streamName). Socket.io also assigns a unique identifier to each user which will be the sourceId when publishing media to the stream.

Once a room is created, a user can join the room and become a participant. When a participant joins a room, the server will make a request to the Director API to the get the jwt tokens for viewing the stream associated to the room (via the roomId):

In case the participant is also the owner of the room the server will retrieve a jwt token in parallel to enable publishing as well:

Note that the jwt tokens returned by the Director API have a very short life span and are bound to a streamId, so they can’t be used fraudulently by a rogue participant to publish to a different stream or spam the room.

The final use case when the server needs to interact with the Millicast Director API is when a user is promoted by the host and becomes a speaker. In this case the application and server handle the “raise hand” request, notify the owner of the room, giving the owner the ability to promote the participant to an active speaker, trigerring the server to make another request to the Millicast Director API to retrieve a jwt publish token that will be sent to the participant within the promoted event:

Publishing media

Once your participant has received the jwt publishing token, it is time for the application to start publishing media to the stream.

The audio multiplexing feature can receive multiple audio tracks but only one video track. As a result, in the application we have restricted the ability to publish video only to the owner of the room. We will be using the Millicast Javascript SDK in our demo application:

You will need to configure a Publish object with the streamId and the publish token and call the connect method using the socket.io room id as the streamId and the socket.io client id as the sourceId, making sure to enable dtx:

You can view the full code on Github here:

https://github.com/millicast/millicast-spaces-app/blob/e88febc0fdfe9c2de77f5ddcbc3e4e54552b8e72/src/pages/rooms-form/rooms-form.ts#L206

Viewing media

Once you receive the viewer token we use the Viewer object of the SDK to start receiving the media for the stream:

We will set the pinnedSourceId to the room’s owner sourceId for all participants which are not the owner. This insures the Millicast Viewer server will always send you the audio (and video) of the owner in the main stream.

Then we will configure three audio tracks for multiplexing (in the multiplexedAudioTracks parameter) that will be used to multiplex data. By default, libwebrtc only mixes the top three audio levels detected within the tracks received, so this seems like a good trade off to achieve the best performance.

Lastly, we set our own sourceId in the excludedSourceIds parameter so the server does not multiplex our own audio back to us.

Once the connection is established, you will receive one event for each remote track in the main stream and one for each multiplexed track. You can differentiate between them either by the number of audio tracks in the stream (all multiplexed audio tracks will be associated to the same stream) or by the mid order of the associated transceiver (the first ones of each belonging to the main stream).

You can view the full code here on Github here:

https://github.com/millicast/millicast-spaces-app/blob/e88febc0fdfe9c2de77f5ddcbc3e4e54552b8e72/src/pages/rooms-form/rooms-form.ts#L273

Detecting who is talking

Once you have everything setup, your application may want to display which of the participants is speaking and show a voice level indicator.

To help the application with that, the viewer server will send several events using the websocket connection used to send the view command, and that are exposed by the Javascript SDK to the application:

https://github.com/millicast/millicast-spaces-app/blob/e88febc0fdfe9c2de77f5ddcbc3e4e54552b8e72/src/pages/rooms-form/rooms-form.ts#L345

An event is created when a new source has been published or removed from the stream. And another “vad” event indicates when a source id is being multiplexed into the audio track based on the voice activity level.

Those events allow you to immediately detect which participant are received in each track, so you just need to use the webrtc peerconnection stats to retrieve the audio level for that track:

https://github.com/millicast/millicast-spaces-app/blob/e88febc0fdfe9c2de77f5ddcbc3e4e54552b8e72/src/pages/rooms-form/rooms-form.ts#L420

What’s next?

The multisource and audio multiplexing are just the first two features that Millicast has made available in order to allow our customers to implement interactivity into their applications.

But this is only the beginning.

We are currently working on a projection feature that will allow the viewer to choose any of the sources in a stream they want to receive dynamically (with both audio and video!) and a secondary stream view feature that will allow you to watch two different streams simultaneously.

The first one will enable Millicast to be used as a massively distributed SFU and implement any type of meeting-like application (build your own Zoom, Meet or Teams).

The second one will allow you to implement watch parties, where the first stream is the multiconference “party”, and the second stream is the streamed content itself.

What are you waiting for?

To join this Millicast Interactive Beta, you need to:

We will enable the multisource capabilities on your Millicast account so you can start building your own real-time interactive streaming applications and take over the world.

The Fastest Streaming on Earth. Realtime WebRTC CDN built for large-scale video broadcasting on any device with sub-500ms latency.