WebRTC: Dynamic Webcam using AI
Nilesh Gawande
Co-founder and VP - Innovations at SpringCT. Creator of ProCONF, Creator of ARIA. Expertise in architecting in Video Conferencing systems (WebRTC), Digital Human, CoBrowsing, WebXR, Healthcare and IoT systems.
"Dynamic webcam" refers to a webcam with advanced features that allow it to adapt its settings in real-time based on certain conditions. In this blog we will cover one aspect of Dynamic Webcam that will allow us to track user in the camera frame and adjust his video stream such that he will remain in center position irrespective of his position in front of camera.
See below image. If the user is seating little way from camera other users in conference will see him smaller in tile (remote view)
This can be improved by utilizing a Dynamic Webcam. Take a look at the adjusted remote view below. Doesn't it present the user nicely centered within the frame?
Human Face Detection in Live Stream:
The initial step in addressing this challenge involves detecting individuals within a webcam stream. We will employ the Mediapipe AI model from TensorFlow to achieve user detection. To incorporate Mediapipe into your project, please follow the instructions provided in this link
https://github.com/tensorflow/tfjs-models/tree/master/face-detection/src/mediapipe
Include the following scripts in index.html
<!-- Require the peer dependencies of face-detection. --
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/face_detection"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-core"></script>
<!-- You must explicitly require a TF.js backend if you're not using the TF.js union bundle. -->
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs-backend-webgl"></script>
<script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/face-detection"></script>>
and initializefaceDetector as follows:
// Initialize the object detecto
const initializefaceDetector = async () => {
? const model = faceDetection.SupportedModels.MediaPipeFaceDetector;
? const detectorConfig = {
? ? runtime: 'mediapipe',
? ? modelType: 'full',
? ? maxFaces: 6,
? ? solutionPath: 'https://cdn.jsdelivr.net/npm/@mediapipe/face_detection',
? };
? faceDetector = await faceDetection.createDetector(model, detectorConfig);
};
Once the model is loaded, we are ready to process the local video stream for presence of human face. This can be done by the following:
detections = await faceDetector.estimateFaces(video, {flipHorizontal: false});
The video parameter passed here is HTML video element which holds the local video stream obtained from navigator.mediaDevices.getUserMedia(). The detections will hold user face(s) coordinates which can be further used to clip user from stream and place him in the center position in destination stream.
Conference on Janus Media Server:
To prove the concept of Dynamic Webcam we must process local stream and the corrected stream should be transmitted in video conference so that other users in conference see a corrected view. For this reason, we are going to use Janus media server. We need at-least two users to join the conference and share their video stream. Both user would start video conference normally by joining the room. Both users will see each other's video (local view and remote view). Now first user can enable the dynamic webcam by clicking on checkbox. The effect of dynamic webcam is to be observed by the second user.
To see the effect of dynamic webcam, we have written a function that will toggle user's feed in video conference to transmit original video or processed (croppedStream) video. See below toggleDynamicWebcam() function:
function toggleDynamicWebcam()
? // listen to change of isDynamic button state
? let isDynamic = document.getElementById('isdynamic');
? console.log('IsDynamic state:', isDynamic?.checked);
?
? if (isDynamic.checked) {
? ? video.srcObject = localVideoStream;
? ? video.play();
? ? video.onloadedmetadata = () => {
? ? ? predictWebcam();
? ? ? const croppedStream = backGroundCanvas.captureStream();
? ? ? document.getElementById('local_video').srcObject = croppedStream;
? ? ? replaceVideoTrack(croppedStream.getVideoTracks()[0]);
? ? }
? } else {
? ? replaceVideoTrack(localVideoStream.getVideoTracks()[0]);
? ? document.getElementById('local_video').srcObject = localVideoStream;
? ? cancelAnimationFrame(animationTimer);
? ? cancelAnimationFrame(secDrawAnimTimer);
? ? video.srcObject = null;
? ? dummyVideos.srcObject = null;
? }
}
领英推荐
The predictWebcam function takes care of using faceDetector model. Here identification of user in frame is done and relevant portion of source frame is cropped and using drawCroppedFrame() function it is copied on to a background canvas.
// Prediction loop
async function predictWebcam() {
? // Now let's start classifying the stream.
? let detections = [];
? if(!isUpdating) {
? ? try {
? ? ? detections = await faceDetector.estimateFaces(video, {flipHorizontal: false});
? ? ? ENABLE_LOG && console.log('faces:', detections);
? ? } catch (error) {
? ? ? console.error('error in estimate faces:', error);
? ? }
? ? for (let n = 0; n < detections.length; n++) {
? ? ? if(detections[n].box.xMin <= boundingBoxLeftMost.x) {
? ? ? ? setBox(boundingBoxLeftMost, detections[n].box);
? ? ? }
? ? ? if((detections[n].box.xMin + detections[n].box.width)
? ? ? ? > (boundingBoxRightMost.x + boundingBoxRightMost.width)) {
? ? ? ? setBox(boundingBoxRightMost, detections[n].box);
? ? ? }
? ? ? if(detections[n].box.yMin <= boundingBoxTopMost.y) {
? ? ? ? setBox(boundingBoxTopMost, detections[n].box);
? ? ? }
? ? ? if((detections[n].box.yMin + detections[n].box.height)
? ? ? ? > (boundingBoxBelowMost.y + boundingBoxBelowMost.height)) {
? ? ? ? setBox(boundingBoxBelowMost, detections[n].box);
? ? ? }
? ? }
? ? targetBbox = [
? ? ? boundingBoxLeftMost.x,
? ? ? boundingBoxTopMost.y,
? ? ? boundingBoxRightMost.x - boundingBoxLeftMost.x + boundingBoxRightMost.width,
? ? ? boundingBoxBelowMost.y - boundingBoxTopMost.y + boundingBoxBelowMost.height,
? ? ]
? ? resetBboxes();
? ? ENABLE_LOG && console.log('targetBbox:', targetBbox, detections.length);
? ? if(detections.length > 0 && !isUpdating) {
? ? ? updateCroppingBoxDimension(targetBbox);
? ? }
? }
? drawCroppedFrame();
? animationTimer = window.requestAnimationFrame(predictWebcam);
}
Please note, we have used window.requestAndimationFrame(predictWebcam) to call predictWebcam function recursively in a loop.
Generate new stream using drawCroppedFrame()
function drawCroppedFrame()
? const context = backGroundCanvas.getContext('2d');
? const x = boundingBox.x - (boundingBox.width / 2);
? const y = boundingBox.y - (boundingBox.height / 1.5);
? let videoWidth = 2 * (boundingBox.x - x) + boundingBox.width;
? let videoHeight = 3 * (boundingBox.y - y) + boundingBox.height;
? videoWidth = x + videoWidth >= video.videoWidth ? video.videoWidth - x : videoWidth;
? videoHeight = y + videoHeight >= video.videoHeight ? video.videoHeight - y : videoHeight;
? const hRatio = backGroundCanvas.width / videoWidth;
? const vRatio = backGroundCanvas.height / videoHeight;
? const ratio = Math.min(hRatio, vRatio);
? const centerShiftX = (backGroundCanvas.width - videoWidth * ratio) / 2;
? const centerShiftY = (backGroundCanvas.height - videoHeight * ratio) / 2;
? context.clearRect(0, 0, backGroundCanvas.width, backGroundCanvas.height);
? context.fillStyle = 'grey';
? context.fillRect(0, 0, backGroundCanvas.width, backGroundCanvas.height);
? context.drawImage(video, parseInt(x, 10), parseInt(y, 10),
? ? parseInt(videoWidth, 10), parseInt(videoHeight, 10),
? ? parseInt(centerShiftX, 10), parseInt(centerShiftY, 10),
? ? parseInt(videoWidth * ratio, 10), parseInt(videoHeight * ratio, 10)
? );
}
Source Code:
Please refer this git link for the entire source code:
Please note: Before you run this code, please ensure you set janusURL to your Janus installation. The current janusURL is that of public janus deployment and you will have to use room name as 1234 which is a default room.
const janusURL = 'wss://janus.conf.meetecho.com/ws'
Also note: this is not a production quality code. janus-service.js is modified to support two users only.
A little History:
I came across a great product offered by Poly at UCX London event. They provide similar feature but using hardware based camera Studio-x70 (https://www.poly.com/in/en/products/video-conferencing/studio/studio-x70). This camera offers various features including dynamic zoom-in/out. This was an inspiration for me to find a similar software based solution for video conferencing systems and hence we created dynamic webcam project.
At SpringCT, we develop high quality video conferencing solutions. Our in-depth knowledge of WebRTC and media server has helped us build great conferencing products for our customers.
We have a team of experienced and skilled developers who specialize in WebRTC, utilizing the latest tools, frameworks, and best practices to create robust and reliable applications. Our developers are well-versed in the intricacies of WebRTC protocols, enabling them to optimize performance, minimize latency, and ensure seamless connectivity across different devices and browsers. For more details visit us at https://www.springct.com/collaboration/
Author: Nilesh Gawande
CoAuthor: Ayan Karmakar