Streamect

github.com/bobbtim/metacollectionofppl →

tartanhacks 26

A real-time wearable vision and audio intelligence system built on Meta smart glasses at TartanHacks.

Computer VisionWebRTCAzure AINext.js

Overview

Streamect is a real-time wearable intelligence system built at TartanHacks that turns Meta/Ray-Ban smart glasses into a live sensing platform for computer vision, speech transcription, and AI-powered interaction analysis. The use case: helping sales professionals in high-touch environments like luxury retail, consulting, or financial advising with safe, consensual face and gesture tracking to evaluate employee interactions and surface personalized customer context like birthdays or past visit notes to help close sales.

The Pipeline

Since the Meta glasses don't expose a developer-friendly raw camera stream, we engineered a custom low-latency pipeline: the live POV feed routes through a WhatsApp video call, gets captured in OBS Studio as a virtual webcam, and enters Python via OpenCV as real time NumPy arrays. That gave me frame-by-frame programmatic access to first-person video. Audio follows a parallel path using Web Audio API nodes to mix the wearer's voice and ambient room audio, downsample to 16kHz PCM, and stream chunks to Azure Speech for near-real-time transcription.

Computer Vision

On top of the video feed, we run MediaPipe's Face Landmarker and GestureRecognizer in real time. Raw detections are smoothed frame-to-frame with lightweight identity tracking so faces maintain stable bounding boxes and consistent IDs instead of flickering. Azure Face API handles the recognition side. Person creation, group membership, persisted faces, and identification. All needed so the system can move beyond generic detection into actually recognizing returning customers.

Full-Stack System

The backend connects live media to Azure SQL for persistent storage, an in-memory cache for responsiveness, and Azure OpenAI for conversation summarization. WebRTC handles the streaming layer with WebSocket signaling for session negotiation. The whole thing is wrapped in a Next.js app with REST routes for face operations, recordings, transcription jobs, and summaries. Plus a processing UI for profile management, identity merge/split, and per-conversation playback with aligned transcripts.

What I Learned

Getting the glasses stream into code required debugging OBS transforms, windows device indexing, and resolution integrity before any AI layer could start. A small issue in capture setup or audio routing breaks everything downstream. Making messy real-world inputs usable for intelligence is the actual hard problem.

Meta glasses are a pain in the butt to work with

comments

Loading comments...