Virtual Photobooth Logo

Virtual Photobooth

Real-Time Multi-User Photobooth with GAN-Based Style Transfer

Role

Full-Stack Developer

Stack

PyTorch · WebRTC · MediaPipe · Flask · JavaScript

Timeline

Spring 2025 Academic Project

Project Overview

A real-time virtual photo booth that enables two remotely connected users to appear in the same artistic-styled frame using peer-to-peer video streaming and GAN-based style transfer. Built as an end-to-end web application combining computer vision, generative AI, and real-time video processing.

The Challenge

Create an interactive multi-user photo booth that goes beyond single-user filters (Instagram, Snapchat) by enabling real-time artistic style transfer for two people in a shared virtual space.

Technical Challenges:

  • Real-time person segmentation and compositing across different devices
  • Managing peer-to-peer video streaming with low latency
  • Implementing GAN-based style transfer fast enough for interactive use
  • Cross-browser compatibility for WebRTC and canvas operations

Technical Stack

Machine Learning: PyTorch, Pix2Pix GAN, U-Net architecture, PatchGAN discriminator

Computer Vision: MediaPipe, HTML5 Canvas, real-time segmentation

Web Technologies: JavaScript, WebRTC, PeerJS, RESTful APIs

Backend: Flask, base64 image processing, GPU acceleration

Development: Cross-browser testing, performance optimization

Technical Implementation

System Architecture

System Architecture

Frontend: Browser-based interface (HTML5, CSS3, JavaScript)

Video Streaming: WebRTC peer-to-peer connections via PeerJS library

Backend: Flask server hosting trained Pix2Pix model with RESTful API

Computer Vision: MediaPipe Selfie Segmentation for real-time person extraction

Machine Learning Pipeline

Pix2Pix GAN Implementation:

  • Generator: U-Net architecture with 7 downsampling layers and skip connections
  • Discriminator: PatchGAN architecture for texture-focused adversarial training
  • Training: 150 epochs constant learning rate + 50 epochs linear decay
  • Loss Function: Combined adversarial loss + L1 loss (λ=100) for structure preservation

Model Performance:

  • Resolution: 128×128 processing (proof of concept)
  • Training Time: 15 hours on available hardware (~1000 image pairs)
  • Inference Speed: 70-90ms per frame with GPU acceleration

Real-Time Video Processing

Person Segmentation:

  • MediaPipe Selfie Segmentation for binary mask generation
  • HTML5 Canvas compositing operations (destination-out, source-over)
  • Robust performance across lighting conditions and subject variations

Video Compositing Pipeline:

User A Video → Person Segmentation → Silhouette Extraction → Composite with User B → Style Transfer → Real-time Display

Performance Optimization:

  • Client-side throttling to 10 FPS for style transfer requests
  • Frame rate balancing between visual fluidity and server load
  • Efficient base64 encoding for API communication
Demo

Technical Achievement

Real-Time GAN Inference

Challenge: Standard GAN models too slow for interactive applications

Solution: Optimized Pix2Pix architecture with reduced resolution and efficient server deployment

Result: Achieved 70-90ms processing time with 10-15 fps sustainable frame rate for real-time interaction

Multi-User Video Coordination

Challenge: Synchronizing video streams between different devices and network conditions

Solution: WebRTC peer-to-peer connections with PeerJS abstraction layer

Result: Stable connections with 300-500ms end-to-end latency and minimal server infrastructure

Browser-Based Computer Vision

Challenge: Running complex segmentation models efficiently in web browsers

Solution: MediaPipe's optimized web assembly implementation for client-side processing

Result: Robust person segmentation across lighting conditions while eliminating server load

Cross-Browser Compatibility

Challenge: WebRTC and canvas inconsistencies across different browsers

Solution: Browser detection with fallback handling and optimized performance for Chrome/Firefox

Result: Consistent user experience with graceful degradation for unsupported features

Performance Optimization

Challenge: Balancing visual quality with real-time performance constraints

Solution: Strategic resolution choices (128×128 processing, 640×480 display) with client-side throttling

Result: Interactive performance achieved; identified 320×240 as optimal production resolution

Results

Future Enhancement

Technical Improvements:

  • Higher Resolution Models: Scale to 320×240 for improved visual quality
  • Depth-Aware Compositing: Implement depth estimation for realistic spatial relationships
  • Lighting Normalization: Match lighting conditions between video streams
  • Temporal Consistency: Reduce frame-to-frame flicker in style transfer

Project Impact

This project demonstrates the practical integration of advanced ML techniques with modern web technologies, showcasing skills in:

  • Generative AI Implementation: From research papers to working applications
  • Real-Time Systems: Managing latency and performance constraints
  • Cross-Platform Development: Browser-based ML deployment

The successful combination of GANs, computer vision, and web technologies creates an interactive experience that pushes the boundaries of what's possible in browser-based machine learning applications.