Gaurav Sarma

Someone posted on Twitter about the exorbitant price that Zoom charges if one wants to organise a single session with more than 10,000 attendees.

It's around $6000.

That kind of led me to wonder what goes behind building something of this scale and why would it be charged so high. And that's where I encountered Mediasoup as a SFU.

What is a SFU?

First, of all, let's cover what an SFU is and what are the alternatives to it. SFU stands for Selective Forwarding Unit. SFU is a centralised media server which receives streams from multiple entities and forwards it to one or more receivers. The SFU controls how and what data should be sent to the receivers.

For example, if you want to build a multi-producer and multi-consumer app (which we will be building later below), then you can send video and audio streams from the producers to only the consumers. You could degrade or enhance the experience of a candidate based on their plan.

Is SFU the only option? There are other options such as MCU or p2p.

MCU or Multipoint Control Unit servers receives multiple streams from multiple sources, merges them to a single stream and sends them to all destinations. This means that the degree of control over what data to send is lesser in this scenario.

Introduction to Mediasoup

Mediasoup is an open-source, server-side WebRTC (Web Real-Time Communication) library designed for building scalable real-time communication applications. It functions as a SFU. Mediasoup can be extended and used in multiple client libraries of NodeJS and Rust.

In this blog, we will demonstrate a basic producer consumer setup using Mediasoup in NodeJS.

As mentioned above, SFU is a centralised entity which can receive and send streams to other clients. So we need to have one server, acting as a SFU and at least one client in our example. In our case, the client will act as a producer and it will also consume it's own stream.

We will use the SocketIO NodeJS module for the client and server to interact with each other. We will not go deep into how SocketIO works but it's mainly a module which internally uses websockets or http polling to send and fetch data. We need websockets to establish a bi-directional channel as the server also has to send information to the client.

Basic client server connection

// On the server, to start the SocketIO server
io = new Server(appServer, {
  cors: {
    origin: "http://localhost:3000", // Replace with your Next.js client's origin
    methods: ["GET", "POST"], // Allow necessary HTTP methods
    credentials: true, // Allow sending cookies, if needed
  },
});

// To receive the connection
io.on("connection", (socket: Socket) => {
  console.log("New client connected", socket.id);

  socket.on("disconnect", () => {
    console.log("Client disconnected", socket.id);
    removeNode(socket.id);
  });
});

// On the client
socket = io("http://127.0.0.1:8080");
socket.on("connect", async () => {
  console.log("Connected to signaling server");
});

// To send a message on the connection,
socket.emitWithAck("consumer-resume", { consumerId: consumer.id });

Using your webcam and audio

To enable both video and audio on your machine, you can set vide: true and audio: true accordingly. Once the streams are enable, then you can assign the stream to localVideoRef which is a reference to HTMLVideoElement.

let localVideoRef = useRef < HTMLVideoElement > null;

const getMedia = async () => {
  try {
    const localStream = await navigator.mediaDevices.getUserMedia({
      video: true,
      audio: true,
    });

    if (localVideoRef.current) {
      localVideoRef.current.srcObject = localStream;
    }
    if (!localVideoRef.current) {
      console.error("Local video element is not available");
      return;
    }

    localVideoRef.current.srcObject = stream;
    const track = stream.getVideoTracks()[0];

    console.log("Got MediaStream:", localStream);
  } catch (error) {
    console.error("Error accessing media devices.", error);
  }
};

The localVideoRef can then be displayed in the DOM in this way:

<video
  className="mx-5"
  width="40%"
  ref={localVideoRef}
  autoPlay
  muted
></video>

Setting up the Mediasoup server

In this section, we cover the server initialisation steps that it does to be ready to receive any kind of streaming traffic. Before listening to any kind of incoming traffic, the server sets up the Mediasoup entities which start the required background processes and workers to process these connections.

Since the Mediasoup server is a standalone NodeJS server, we need to start a Mediasoup Worker. The worker represents a C++ subprocess that handles the heavy lifting of media processing. It is the core component responsible for managing and manipulating audio and video streams.

createWorker = async () => {
  mediasoup.observer.on("newworker", (worker: types.Worker) => {
    console.log("new worker created [pid:%d]", worker.pid, worker.appData);
  });
  const worker = await mediasoup.createWorker({
    logLevel: "debug", // Set the general log level to debug
    //logTags: ["ice", "dtls"],
    appData: { foo: 123 },
    //dtlsCertificateFile: "./keys/cert.pem",
    //dtlsPrivateKeyFile: "./keys/key.pem",
  });
  return worker;
};

The Mediasoup Router is a core component that acts as the SFU for real-time media streams. Its primary function is to manage and route audio and video RTP packets between different participants (producers and consumers) within a given media session, often analogous to a "multi-party conference room."

createRouter = async () => {
    this.worker.observer.on("newrouter", (router) => {
      console.log("new router created [id:%s]", router.id);
    });
    const router = await this.worker.createRouter({ mediaCodecs });
    return router;
  };

In the above snippet, the createRouter method is being passed a mediaCodecs hash. We will cover that in the later sections.

Sending the RTP capabilities

In a previous snippet, we saw that the router was being passed a mediacodecs hash.

RTP capabilities define the media formats and features that a WebRTC endpoint, like a mediasoup router, can handle. They are essential for a server and client to negotiate and agree upon a common set of options for transmitting real-time audio and video. The capabilities describe what an endpoint is able to receive, while the RTP parameters specify what a producer endpoint is actually sending. The receiver's capabilities constrain the sender's parameters.

The process is:

The mediasoup router exposes its RTP capabilities via the router.rtpCapabilities property.
Client requests for the capabilities and the server sends the router's RTP capabilities to the client, which is running mediasoup-client.
The client-side mediasoup-client device is loaded with the server's capabilities. It then uses its own browser capabilities and the router's capabilities to determine the final, negotiated capabilities for the session.

// On the client
const initConnectionWithServer = async (socket: Socket) => {
  routerRtpCapabilities = await socket.emitWithAck("getRouterRtpCapabilities");
  deviceRef.current = new Device();
  await deviceRef.current.load({ routerRtpCapabilities });
};

The Device is the central client-side object that represents a users local endpoint to connect to a mediasoup Router. It acts as the bridge between your client application and the mediasoup server, handling the browser-specific WebRTC details for you.

The media codecs defined on the server:

const mediaCodecs: types.RouterOptions["mediaCodecs"] = [
  {
    kind: "video",
    mimeType: "video/vp8",
    preferredPayloadType: 100, // Example payload type
    clockRate: 90000,
    parameters: {},
    rtcpFeedback: [
      { type: "nack" },
      { type: "nack", parameter: "pli" },
      { type: "ccm", parameter: "fir" },
      { type: "goog-remb" },
    ],
  },
  {
    kind: "video",
    mimeType: "video/H264",
    clockRate: 90000,
    parameters: {
      "packetization-mode": 1,
      "profile-level-id": "42e01f",
      "level-asymmetry-allowed": 1,
    },
  },
];

In the above example, we have 2 blocks for video codecs. The negotiation will try to check which codec capability is available in both the client and the server and is chosen.

Setup the `sendTransport` and `recvTransport` methods

We are now getting to the fun stuff. Prior to this, we have setup the server side entities like the router and the worker. On the client side, we have setup the mediasoup device which takes care of incoming connections. These entities handle the connections that we will be making hereforth.

The communication between a Mediasoup client and server and vice-versa is unidirectional. This means, that for every stream that the client sends to the server, there needs to be a specific transport and if the client wants to receive a stream from the server, the server needs to open up another transport.

Another point of note is that every transport has a sender and a receiver. To send a stream, the specific entity has to call sendTransport and to receive a stream, it has to call recvTransport.

One point of differentiation from other networking libraries that I have worked with, is the Mediasoup requires both the server and the client to have it's own version of sendTransport and recvTransport for every stream it works on.

The below code snippet in the client emits a createWebRtcTransport call to the server. The server creates a WebRtcTransport and passes back the transport.id to the client. It also sends back ice and dtls parameters back to the client.

Using these information, the client's Device object also creates a corresponding transport entity using the createSendTransport method.

Once the transport is created on both the client and the server, the client calls sendTransport.produce with the track information. Calling produce on the sendTransport on the client emits the connect and the produce messages on which the server also creates the required transport connections.

// In the client

const createSendTransport = (socket: Socket) => {
    socket.emit(
      "createWebRtcTransport",
      { sender: true },
      ({ params }: { params: any }) => {
        if (params.error) {
          console.log(params.error);
          return;
        }
        if (deviceRef.current == null) {
          console.error("Device is not initialized yet in createSendTransport");
          return;
        }
        sendTransport = deviceRef.current.createSendTransport(params);

        sendTransport.on(
          "connect",
          async ({ dtlsParameters }, callback, errback) => {
            try {
              await socket.emit("transport-connect", {
                dtlsParameters,
                transportId: sendTransport.id,
              });

              // Tell the transport that parameters were transmitted.
              callback();
            } catch (error: any) {
              errback(error);
            }
          }
        );

        sendTransport.on("produce", async (parameters, callback, errback) => {
          try {
            await socket.emit(
              "transport-produce",
              {
                kind: parameters.kind,
                rtpParameters: parameters.rtpParameters,
                appData: parameters.appData,
                transportId: sendTransport.id,
              },
              ({ id }: { id: any }) => {
                callback({ id });
                // Uncomment if you want to create a client receiver
                //createClientReceiver(socket, id);
              }
            );
          } catch (error: any) {
            errback(error);
          }
        });
        connectSendTransport();
      }
  )}

const connectSendTransport = async () => {
  producer = await sendTransport.produce(params);
  console.log("Producer created:", producer.id, producer.kind);

  producer.on("trackended", () => {
    console.log("track ended");
  });

  producer.on("transportclose", () => {
    console.log("transport ended");
  });
};

// In the server
socket.on("createWebRtcTransport", async (data, callback) => {
  console.log("Received createWebRtcTransport", data, callback);
  const transport: mediasoupTypes.WebRtcTransport = await rtc.createWebRtcTransport();
  callback({
    params: {
      id: transport.id,
      iceParameters: transport.iceParameters,
      iceCandidates: transport.iceCandidates,
      dtlsParameters: transport.dtlsParameters,
    },
  });
});

socket.on(
  "transport-connect",
  async ({ dtlsParameters, transportId }) => {
    console.log("Received transport-connect");
    await sendTransports[transportId].connect({ dtlsParameters });
  }
);

socket.on(
  "transport-produce",
  async ({ kind, rtpParameters, appData, transportId }, callback) => {
    console.log("Received transport-produce");
    const producer: mediasoupTypes.Producer = await sendTransports[
      transportId
    ].produce({
      kind,
      rtpParameters,
    });

    console.log("Producer ID: ", producer.id, producer.kind);

    producer.on("transportclose", () => {
      console.log("transport for this producer closed ");
      producer.close();
    });

    // Send back to the client the Producer's id
    callback({
      id: producer.id,
    });
    registerNewProducer(producer);
  }
);

This sets up the stream connection from the client to the server. As mentioned previously, we also want the client to receive the video stream from the server.

This requires the client and server to create receiver transports respectively. To create a receiver transport, both the client and the server have to call createRecvTransport on their respective connections.

One major difference in the sender and the receiver flow is the receiver also needs the producer's information to start receiving traffic. The producer information is the same information that the server captured in the first flow when the client was sending streams to the server.

Remaining of the receiving traffic is pretty similar to the sending traffic flow.

Dump of the entire Flow for easier visualisation

// Initialisation
client -> connect websocket -> server
client -> getRtpCapabilities -> server
client -> createDevice
client -> initiate mediastream and reference the stream in the video tag

// Sending data from client to server
client -> createWebRtcTransport -> server
server -> createSendTransport
client -> createSendTransport
client -> sendTransport.produce
client -> transport-connect -> server
client -> transport-produce -> server
server -> sendTransport.connect
server -> sendTransport.produce

// Receiving data from server to client
client -> createWebRtcTransport -> server
server -> createRecvTransport
client -> createRecvTransport
client -> consume -> server
server -> recvTransport.consume
client -> recvTransport.consume
client -> transport-recv-connect -> server
server -> recvTransport.connect
client -> consumer-resume -> server
server -> consumer.resume

Conclusion

Mediasoup is a pretty nifty module to setup as a SFU. The code snippets in the post can be found here.

Another interesting point that this post doesn't cover is how to debug when your webrtc streams don't work as expected. I plan to create another post where I capture the best way to debug if a webrtc stream has been setup the right way. For example, I did face multiple issues while setting up the dtls parameter because of a server configuration. Identifying the actual problem is important to solve the issue for which webrtc has great tooling support.

References

https://github.com/gsarmaonline/mediasoup-basic

Hope you liked reading the article.

Please reach out to me here for more ideas or improvements.