Streamect
A real-time wearable vision and audio intelligence system built on Meta smart glasses at TartanHacks.
Overview
Streamect is a real-time wearable intelligence system built at TartanHacks that turns Meta/Ray-Ban smart glasses into a live sensing platform for computer vision, speech transcription, and AI-powered interaction analysis. The use case: helping sales professionals in high-touch environments like luxury retail, consulting, or financial advising — with safe, consensual face and gesture tracking to evaluate employee interactions and surface personalized customer context like birthdays or past visit notes to help close sales.
The Pipeline
Since the Meta glasses don't expose a developer-friendly raw camera stream, we engineered a custom low-latency pipeline: the live POV feed routes through a WhatsApp video call, gets captured in OBS Studio as a virtual webcam, and enters Python via OpenCV as real-time NumPy arrays. That gives us frame-by-frame programmatic access to first-person video — the foundation for everything else. Audio follows a parallel path using Web Audio API nodes to mix the wearer's voice and ambient room audio, downsample to 16kHz PCM, and stream chunks to Azure Speech for near-real-time transcription.
Computer Vision
On top of the video feed, we run MediaPipe's 478-point Face Landmarker and GestureRecognizer in real time. Raw detections are smoothed frame-to-frame with lightweight identity tracking so faces maintain stable bounding boxes and consistent IDs instead of flickering. Azure Face API handles the recognition side — person creation, group membership, persisted faces, and identification — so the system can move beyond generic detection into actually recognizing returning customers.
Full-Stack System
The backend connects live media to Azure SQL for persistent storage, an in-memory cache for responsiveness, and Azure OpenAI for conversation summarization. WebRTC handles the streaming layer with WebSocket signaling for session negotiation. The whole thing is wrapped in a Next.js 15 app with REST routes for face operations, recordings, transcription jobs, and summaries — plus a processing UI for profile management, identity merge/split, and per-conversation playback with aligned transcripts.
What I Learned
The biggest lesson was that real-world systems work happens in the seams between technologies, not inside the models. Getting the glasses stream into code required debugging OBS transforms, Windows device indexing, and resolution integrity before any AI layer could start. A small issue in capture setup or audio routing breaks everything downstream. Low-latency multimodal systems are only as strong as their weakest integration point — and making messy real-world inputs usable for intelligence is the actual hard problem.