Side-by-side view: webcam with hand tracking (left) + Reachy Mini robot simulation with answer overlays (right)
Seven-Stage Pipeline
Raw webcam pixels become a spoken robot response in under 3 seconds. Everything runs locally.
Hand Tracking
MediaPipe HandLandmarker
Extracts 21 3D keypoints per hand at 30 FPS. Every joint from wrist to fingertip, giving precise spatial data for classification.
Sign Classification
Geometric Heuristics
Dual-check finger detection: palm-center distance ratio (1.2x threshold) + PIP joint angle (150° threshold). Eliminates false positives from single-metric approaches.
Gloss Sequencing
Temporal Buffer
Hold-time validation (0.4s), duplicate removal, and timeout detection (2s) turn noisy per-frame detections into clean sign sequences.
Gloss Translation
Glossa-BART (Seq2Seq Transformer)
ASL gloss is not English -- word order differs, function words are dropped. A fine-tuned BART model with beam search translates gloss to natural sentences.
Intent Classification
Sentence-Transformers (all-MiniLM-L6-v2)
384-dimensional embeddings + cosine similarity classify questions into 5 intent categories: identity, time, date, location, general.
Answer Generation
Ollama + Llama 3.2 3B
Local LLM generates contextual answers with intent-aware prompts. Capped at 2 sentences for natural spoken output. No cloud API needed.
Robot Response
Reachy Mini SDK + MuJoCo + TTS
Reachy Mini nods and wiggles its antennas. MuJoCo renders the simulation. TTS speaks the answer aloud. Three response channels simultaneously.