Table of Contents

  • Project Overview 

  • The Challenge

  • Technical Stack

  • Technical Implementation

  • Technical Achievement

  • Future Enhancement

Project Overview

A real-time virtual photo booth that enables two remotely connected users to appear in the same artistic-styled frame using peer-to-peer video streaming and GAN-based style transfer. Built as an end-to-end web application combining computer vision, generative AI, and real-time video processing.

My Role: Solo Full-Stack Developer | Timeline: Spring 2025 Academic Project

The Challenge

Create an interactive multi-user photo booth that goes beyond single-user filters (Instagram, Snapchat) by enabling real-time artistic style transfer for two people in a shared virtual space.

Technical Challenges:

  • Real-time person segmentation and compositing across different devices

  • Managing peer-to-peer video streaming with low latency

  • Implementing GAN-based style transfer fast enough for interactive use

  • Cross-browser compatibility for WebRTC and canvas operations

Technical Stack

Machine Learning: PyTorch, Pix2Pix GAN, U-Net architecture, PatchGAN discriminator

Computer Vision:MediaPipe, HTML5 Canvas, real-time segmentation

Web Technologies: JavaScript, WebRTC, PeerJS, RESTful APIsBackend: Flask, base64 image processing, GPU acceleration

Development: Cross-browser testing, performance optimization

Technical Implementation

System Architecture

Frontend: Browser-based interface (HTML5, CSS3, JavaScript) Video Streaming: WebRTC peer-to-peer connections via PeerJS library Backend: Flask server hosting trained Pix2Pix model with RESTful API Computer Vision:MediaPipe Selfie Segmentation for real-time person extraction

Machine Learning Pipeline

Pix2Pix GAN Implementation:

  • Generator: U-Net architecture with 7 downsampling layers and skip connections

  • Discriminator: PatchGAN architecture for texture-focused adversarial training

  • Training: 150 epochs constant learning rate + 50 epochs linear decay

  • Loss Function: Combined adversarial loss + L1 loss (λ=100) for structure preservation

Model Performance:

  • Resolution: 128×128 processing (proof of concept)

  • Training Time: 15 hours on available hardware (~1000 image pairs)

  • Inference Speed: 70-90ms per frame with GPU acceleration

Real-Time Video Processing

Person Segmentation:

  • MediaPipe Selfie Segmentation for binary mask generation

  • HTML5 Canvas compositing operations (destination-out, source-over)

  • Robust performance across lighting conditions and subject variations

Video Compositing Pipeline:

User A Video → Person Segmentation → Silhouette Extraction → Composite with User B → Style Transfer → Real-time Display

Performance Optimization:

  • Client-side throttling to 10 FPS for style transfer requests

  • Frame rate balancing between visual fluidity and server load

  • Efficient base64 encoding for API communication

Technical Achievement

Real-Time GAN Inference

Challenge: Standard GAN models too slow for interactive applications

Solution: Optimized Pix2Pix architecture with reduced resolution and efficient server deployment

Result: Achieved 70-90ms processing time with 10-15 fps sustainable frame rate for real-time interaction

Multi-User Video Coordination

Challenge: Synchronizing video streams between different devices and network conditions

Solution: WebRTC peer-to-peer connections with PeerJS abstraction layer

Result: Stable connections with 300-500ms end-to-end latency and minimal server infrastructure

Browser-Based Computer Vision

Challenge: Running complex segmentation models efficiently in web browsers

Solution: MediaPipe's optimized web assembly implementation for client-side processing

Result: Robust person segmentation across lighting conditions while eliminating server load

Cross-Browser Compatibility

Challenge: WebRTC and canvas inconsistencies across different browsers

Solution: Browser detection with fallback handling and optimized performance for Chrome/Firefox

Result: Consistent user experience with graceful degradation for unsupported features

Performance Optimization

Challenge: Balancing visual quality with real-time performance constraints

Solution: Strategic resolution choices (128×128 processing, 640×480 display) with client-side throttling

Result: Interactive performance achieved; identified 320×240 as optimal production resolution

Future Enhancement

Technical Improvements:

  • Higher Resolution Models: Scale to 320×240 for improved visual quality

  • Depth-Aware Compositing: Implement depth estimation for realistic spatial relationships

  • Lighting Normalization: Match lighting conditions between video streams

  • Temporal Consistency: Reduce frame-to-frame flicker in style transfer

Project Impact

This project demonstrates the practical integration of advanced ML techniques with modern web technologies, showcasing skills in:

  • Generative AI Implementation: From research papers to working applications

  • Real-Time Systems: Managing latency and performance constraints

  • Cross-Platform Development: Browser-based ML deployment

The successful combination of GANs, computer vision, and web technologies creates an interactive experience that pushes the boundaries of what's possible in browser-based machine learning applications.

Previous
Previous

Aid Ally

Next
Next

Discovery Hub