Table of Contents
Project Overview
The Challenge
Technical Stack
Technical Implementation
Technical Achievement
Future Enhancement
Project Overview
A real-time virtual photo booth that enables two remotely connected users to appear in the same artistic-styled frame using peer-to-peer video streaming and GAN-based style transfer. Built as an end-to-end web application combining computer vision, generative AI, and real-time video processing.
My Role: Solo Full-Stack Developer | Timeline: Spring 2025 Academic Project
The Challenge
Create an interactive multi-user photo booth that goes beyond single-user filters (Instagram, Snapchat) by enabling real-time artistic style transfer for two people in a shared virtual space.
Technical Challenges:
Real-time person segmentation and compositing across different devices
Managing peer-to-peer video streaming with low latency
Implementing GAN-based style transfer fast enough for interactive use
Cross-browser compatibility for WebRTC and canvas operations
Technical Stack
Machine Learning: PyTorch, Pix2Pix GAN, U-Net architecture, PatchGAN discriminator
Computer Vision:MediaPipe, HTML5 Canvas, real-time segmentation
Web Technologies: JavaScript, WebRTC, PeerJS, RESTful APIsBackend: Flask, base64 image processing, GPU acceleration
Development: Cross-browser testing, performance optimization
Technical Implementation
System Architecture
Frontend: Browser-based interface (HTML5, CSS3, JavaScript) Video Streaming: WebRTC peer-to-peer connections via PeerJS library Backend: Flask server hosting trained Pix2Pix model with RESTful API Computer Vision:MediaPipe Selfie Segmentation for real-time person extraction
Machine Learning Pipeline
Pix2Pix GAN Implementation:
Generator: U-Net architecture with 7 downsampling layers and skip connections
Discriminator: PatchGAN architecture for texture-focused adversarial training
Training: 150 epochs constant learning rate + 50 epochs linear decay
Loss Function: Combined adversarial loss + L1 loss (λ=100) for structure preservation
Model Performance:
Resolution: 128×128 processing (proof of concept)
Training Time: 15 hours on available hardware (~1000 image pairs)
Inference Speed: 70-90ms per frame with GPU acceleration
Real-Time Video Processing
Person Segmentation:
MediaPipe Selfie Segmentation for binary mask generation
HTML5 Canvas compositing operations (destination-out, source-over)
Robust performance across lighting conditions and subject variations
Video Compositing Pipeline:
User A Video → Person Segmentation → Silhouette Extraction → Composite with User B → Style Transfer → Real-time Display
Performance Optimization:
Client-side throttling to 10 FPS for style transfer requests
Frame rate balancing between visual fluidity and server load
Efficient base64 encoding for API communication
Technical Achievement
Real-Time GAN Inference
Challenge: Standard GAN models too slow for interactive applications
Solution: Optimized Pix2Pix architecture with reduced resolution and efficient server deployment
Result: Achieved 70-90ms processing time with 10-15 fps sustainable frame rate for real-time interaction
Multi-User Video Coordination
Challenge: Synchronizing video streams between different devices and network conditions
Solution: WebRTC peer-to-peer connections with PeerJS abstraction layer
Result: Stable connections with 300-500ms end-to-end latency and minimal server infrastructure
Browser-Based Computer Vision
Challenge: Running complex segmentation models efficiently in web browsers
Solution: MediaPipe's optimized web assembly implementation for client-side processing
Result: Robust person segmentation across lighting conditions while eliminating server load
Cross-Browser Compatibility
Challenge: WebRTC and canvas inconsistencies across different browsers
Solution: Browser detection with fallback handling and optimized performance for Chrome/Firefox
Result: Consistent user experience with graceful degradation for unsupported features
Performance Optimization
Challenge: Balancing visual quality with real-time performance constraints
Solution: Strategic resolution choices (128×128 processing, 640×480 display) with client-side throttling
Result: Interactive performance achieved; identified 320×240 as optimal production resolution
Future Enhancement
Technical Improvements:
Higher Resolution Models: Scale to 320×240 for improved visual quality
Depth-Aware Compositing: Implement depth estimation for realistic spatial relationships
Lighting Normalization: Match lighting conditions between video streams
Temporal Consistency: Reduce frame-to-frame flicker in style transfer
Project Impact
This project demonstrates the practical integration of advanced ML techniques with modern web technologies, showcasing skills in:
Generative AI Implementation: From research papers to working applications
Real-Time Systems: Managing latency and performance constraints
Cross-Platform Development: Browser-based ML deployment
The successful combination of GANs, computer vision, and web technologies creates an interactive experience that pushes the boundaries of what's possible in browser-based machine learning applications.