Virtual Photobooth

A real-time virtual photo booth that enables two remotely connected users to appear in the same artistic-styled frame using peer-to-peer video streaming and GAN-based style transfer. Built as an end-to-end web application combining computer vision, generative AI, and real-time video processing.

Create an interactive multi-user photo booth that goes beyond single-user filters (Instagram, Snapchat) by enabling real-time artistic style transfer for two people in a shared virtual space.

Technical Challenges:

Real-time person segmentation and compositing across different devices
Managing peer-to-peer video streaming with low latency
Implementing GAN-based style transfer fast enough for interactive use
Cross-browser compatibility for WebRTC and canvas operations

Machine Learning: PyTorch, Pix2Pix GAN, U-Net architecture, PatchGAN discriminator

Computer Vision: MediaPipe, HTML5 Canvas, real-time segmentation

Web Technologies: JavaScript, WebRTC, PeerJS, RESTful APIs

Backend: Flask, base64 image processing, GPU acceleration

Development: Cross-browser testing, performance optimization

System Architecture

Frontend: Browser-based interface (HTML5, CSS3, JavaScript)

Video Streaming: WebRTC peer-to-peer connections via PeerJS library

Backend: Flask server hosting trained Pix2Pix model with RESTful API

Computer Vision: MediaPipe Selfie Segmentation for real-time person extraction

Machine Learning Pipeline

Pix2Pix GAN Implementation:

Generator: U-Net architecture with 7 downsampling layers and skip connections
Discriminator: PatchGAN architecture for texture-focused adversarial training
Training: 150 epochs constant learning rate + 50 epochs linear decay
Loss Function: Combined adversarial loss + L1 loss (λ=100) for structure preservation

Model Performance:

Resolution: 128×128 processing (proof of concept)
Training Time: 15 hours on available hardware (~1000 image pairs)
Inference Speed: 70-90ms per frame with GPU acceleration

Real-Time Video Processing

Person Segmentation:

MediaPipe Selfie Segmentation for binary mask generation
HTML5 Canvas compositing operations (destination-out, source-over)
Robust performance across lighting conditions and subject variations

Video Compositing Pipeline:

User A Video → Person Segmentation → Silhouette Extraction → Composite with User B → Style Transfer → Real-time Display

Performance Optimization:

Client-side throttling to 10 FPS for style transfer requests
Frame rate balancing between visual fluidity and server load
Efficient base64 encoding for API communication

Real-Time GAN Inference

Challenge: Standard GAN models too slow for interactive applications

Solution: Optimized Pix2Pix architecture with reduced resolution and efficient server deployment

Result: Achieved 70-90ms processing time with 10-15 fps sustainable frame rate for real-time interaction

Multi-User Video Coordination

Challenge: Synchronizing video streams between different devices and network conditions

Solution: WebRTC peer-to-peer connections with PeerJS abstraction layer

Result: Stable connections with 300-500ms end-to-end latency and minimal server infrastructure

Browser-Based Computer Vision

Challenge: Running complex segmentation models efficiently in web browsers

Solution: MediaPipe's optimized web assembly implementation for client-side processing

Result: Robust person segmentation across lighting conditions while eliminating server load

Cross-Browser Compatibility

Challenge: WebRTC and canvas inconsistencies across different browsers

Solution: Browser detection with fallback handling and optimized performance for Chrome/Firefox

Result: Consistent user experience with graceful degradation for unsupported features

Performance Optimization

Challenge: Balancing visual quality with real-time performance constraints

Solution: Strategic resolution choices (128×128 processing, 640×480 display) with client-side throttling

Result: Interactive performance achieved; identified 320×240 as optimal production resolution

Technical Improvements:

Higher Resolution Models: Scale to 320×240 for improved visual quality
Depth-Aware Compositing: Implement depth estimation for realistic spatial relationships
Lighting Normalization: Match lighting conditions between video streams
Temporal Consistency: Reduce frame-to-frame flicker in style transfer

Project Impact

This project demonstrates the practical integration of advanced ML techniques with modern web technologies, showcasing skills in:

Generative AI Implementation: From research papers to working applications
Real-Time Systems: Managing latency and performance constraints
Cross-Platform Development: Browser-based ML deployment

The successful combination of GANs, computer vision, and web technologies creates an interactive experience that pushes the boundaries of what's possible in browser-based machine learning applications.

Real-Time Multi-User Photobooth with GAN-Based Style Transfer

Table of Contents

Project Overview

The Challenge