UPDATE: The Millicast 2.0 release in December 2021 adds interactivity features with both video and audio to enable Millicast to be used as a massively distributed SFU and implement any type of meeting-like application (build your own Zoom, Meet or Teams): https://millicast.medium.com/millicast-2-0-interactive-video-audio-api-e63c09253b74
Clubhouse has been the center of attention since the beginning of 2021, bringing a refreshingly new interactive experience to the world of live streaming.
We want human contact, or at the very least the digital equivalent of human contact: interactive exchanges.
It’s not that interaction is impossible, many tools out there propose interactive videos. But most of those are professional workplace tools made for collaboration and efficiency (Zoom, Google Meet, Microsoft Teams), not designed to reproduce the impromptu exchanges at the coffee machine or water cooler.
The real genius of Clubhouse is the architecture that promotes engagement by design, and through specific layers:
- Stage (a few fully interactive individuals)
- Audience (close to the stage and can ask for the Mic to participate live)
- Following (less engaged, but still interested)
- Live and Now
We see the number of people involved scale up as the interactivity scales down, while not blocking users throughout these layers. Stand up and go on stage, or drop the mic and sit down. Brilliant!
Clubhouse and other recently launched competing technologies are audio-only. Obviously, the next step is to add video to provide an even more engaging user experience.
Most of the traditional social apps followed this evolution path (WhatsApp, Viber, Facebook Messenger). Time is of the essence as it is likely that most of the big social media apps who already have video capabilities are now looking to implement Clubhouse’s engagement model.
Interactive streaming has been a common recurring request this year from Millicast customers, and we have been working for the last several months to provide a solution that is both elegant and fits nicely with our proven, reliable, fast and scalable architecture.
We are happy to announce that Millicast Interactivity is now a reality.
This is how we have achieved it:
When you publish a live feed to Millicast (either by WebRTC or RTMP) using the current API, any previously published live feed is overwritten seamlessly, so viewers will always watch the most recently published stream.
This is done in order to allow reconnections or to replace a malfunctioning encoder before the previously published live feed is disconnected due to a timeout.
In other words, Millicast only allowed a single audio and video feed to be active for any given stream.
With the new multisource feature, Millicast is now able to bundle different independent publication feeds (each of one identified with a different source id) under the same stream, which enables multiple audio and video tracks from different sources to be available to viewers.
The multisource feature is currently only supported with WebRTC sources. The publication of each WebRTC feed is handled normally by just adding a sourceId attribute when sending the publish command to Millicast. In the case where no sourceId is present, the stream will be treated as the default one to ensure backward compatibility. That means that you can use an RTMP publication as the main stream and multiple WebRTC publications with different source ids.
The reconnection feature is still supported within the same source, so if you publish a feed with the same source id as an existing one, the latest media source will be sent to viewers.
The docs are available here: https://docs.millicast.com/docs/multisource-streams
Once you have multiple sources being published as described above, the next step is to receive the sources within the viewer for playback.
By default, the Millicast servers will always negotiate a stream with an audio and video track corresponding to the main source (i.e. the one without a source id) in order to keep backward compatibility with the streams not using multisource.
You can specify the main source stream by using the pinnedSourceId attribute in the view command.
In order to be able to dynamically receive the other sources within a stream, Millicast has implemented a new feature called audio multiplexing. This feature allows the viewer to setup a number of audio tracks when starting the stream view and receive the active sources within the viewer.
The Millicast Viewer node will track the voice activity of each of the incoming audio sources on behalf of the viewer, and decide which are the most active ones, forwarding the audio data in one of the tracks that the viewer has already opened. Each time a track changes the incoming source, the server multiplexes and a “vad” event will be sent to the viewer so it is possible for the application to know exactly which source is being multiplexed in each track at any given time.
As many applications will both publish an audio source and view a stream, it is possible to send a list of excluded source ids within the view command that will not be tracked or multiplexed by the viewer.
It is worth noting that by choosing to multiplex instead of audio mixing, Millicast is able to fully support End-to-End Encryption (E2EE) with interactive streaming since the audio never needs to be processed or re-encoded.
In order to improve the performance of the feature and to avoid incurring higher bandwidth costs, it is recommended to enable dtx (discontinuous transmission) on the publishing side, so audio data is only sent when a user’s voice is detected.
The docs are available here: https://docs.millicast.com/docs/audio-multiplexing
Implementing a Clubhouse-like application
Now, we are going to demonstrate how it is possible to implement an application similar to Clubhouse by combining the features described in the previous section.
The Millicast platform provides you with all of the media functionality required for the application, but does not provide you with a hosted “room” or the logic specific to this Clubhouse use case.
But to make things even easier for you, we have created a demo application and server that you can download from Github and host yourself:
- Demo server: https://github.com/millicast/millicast-spaces-server
- Demo app: https://github.com/millicast/millicast-spaces-app
And also an hosted an online version so you can easily demo it yourself: https://spaces.millicast.com
Setup media publishing and viewing
The first step is to setup a wildcard publish and subscribe token for all of your streams and configure them on your server:
- Publish Stream Token: https://dash.millicast.com/docs.html?pg=millicast-api#publishing-stream-tokens
- Subscribe Stream Token: https://dash.millicast.com/docs.html?pg=millicast-api#subscribing-stream-tokens
The demo server is based on socket.io and will be in charge of handling all of the room and participant logic. The wildcard tokens will be used by the server to make a request to the Millicast Director API in order to get the final jwt token for publishing and viewing a source for each participant.
When a user creates a new room, the socket.io server will assign it an unique identifier that will be used as the Millicast stream name (the final stream id is your account id and the streamName). Socket.io also assigns a unique identifier to each user which will be the sourceId when publishing media to the stream.
Once a room is created, a user can join the room and become a participant. When a participant joins a room, the server will make a request to the Director API to the get the jwt tokens for viewing the stream associated to the room (via the roomId):
In case the participant is also the owner of the room the server will retrieve a jwt token in parallel to enable publishing as well:
Note that the jwt tokens returned by the Director API have a very short life span and are bound to a streamId, so they can’t be used fraudulently by a rogue participant to publish to a different stream or spam the room.
The final use case when the server needs to interact with the Millicast Director API is when a user is promoted by the host and becomes a speaker. In this case the application and server handle the “raise hand” request, notify the owner of the room, giving the owner the ability to promote the participant to an active speaker, trigerring the server to make another request to the Millicast Director API to retrieve a jwt publish token that will be sent to the participant within the promoted event:
Once your participant has received the jwt publishing token, it is time for the application to start publishing media to the stream.
You will need to configure a Publish object with the streamId and the publish token and call the connect method using the socket.io room id as the streamId and the socket.io client id as the sourceId, making sure to enable dtx:
You can view the full code on Github here:
Once you receive the viewer token we use the Viewer object of the SDK to start receiving the media for the stream:
We will set the pinnedSourceId to the room’s owner sourceId for all participants which are not the owner. This insures the Millicast Viewer server will always send you the audio (and video) of the owner in the main stream.
Then we will configure three audio tracks for multiplexing (in the multiplexedAudioTracks parameter) that will be used to multiplex data. By default, libwebrtc only mixes the top three audio levels detected within the tracks received, so this seems like a good trade off to achieve the best performance.
Lastly, we set our own sourceId in the excludedSourceIds parameter so the server does not multiplex our own audio back to us.
Once the connection is established, you will receive one event for each remote track in the main stream and one for each multiplexed track. You can differentiate between them either by the number of audio tracks in the stream (all multiplexed audio tracks will be associated to the same stream) or by the mid order of the associated transceiver (the first ones of each belonging to the main stream).
You can view the full code here on Github here:
Detecting who is talking
Once you have everything setup, your application may want to display which of the participants is speaking and show a voice level indicator.
An event is created when a new source has been published or removed from the stream. And another “vad” event indicates when a source id is being multiplexed into the audio track based on the voice activity level.
Those events allow you to immediately detect which participant are received in each track, so you just need to use the webrtc peerconnection stats to retrieve the audio level for that track:
The multisource and audio multiplexing are just the first two features that Millicast has made available in order to allow our customers to implement interactivity into their applications.
But this is only the beginning.
We have also launched a projection feature that will allow the viewer to choose any of the sources in a stream they want to receive dynamically (with both audio and video!) and a secondary stream view feature that will allow you to watch two different streams simultaneously.
This enables Millicast to be used as a massively distributed SFU and implement any type of meeting-like application (build your own Zoom, Meet or Teams).
What are you waiting for?
To join this Millicast Interactive Beta, you need to:
- Sign up for a free Millicast account (if you haven’t already).
- Send an email to firstname.lastname@example.org and request access to the Millicast Interactive Beta program.
- Access the documentation at https://docs.millicast.com/docs/interactivity
We will enable the multisource capabilities on your Millicast account so you can start building your own real-time interactive streaming applications and take over the world.