Reachy Mini ASL QA

Webcam Feed

🤟

Hand landmark tracking

Reachy Mini

🤖

MuJoCo simulation

Side-by-side view: webcam with hand tracking (left) + Reachy Mini robot simulation with answer overlays (right)

Seven-Stage Pipeline

Raw webcam pixels become a spoken robot response in under 3 seconds. Everything runs locally.

Hand Tracking

MediaPipe HandLandmarker

Extracts 21 3D keypoints per hand at 30 FPS. Every joint from wrist to fingertip, giving precise spatial data for classification.

Sign Classification

Geometric Heuristics

Dual-check finger detection: palm-center distance ratio (1.2x threshold) + PIP joint angle (150° threshold). Eliminates false positives from single-metric approaches.

Gloss Sequencing

Temporal Buffer

Hold-time validation (0.4s), duplicate removal, and timeout detection (2s) turn noisy per-frame detections into clean sign sequences.

Gloss Translation

Glossa-BART (Seq2Seq Transformer)

ASL gloss is not English -- word order differs, function words are dropped. A fine-tuned BART model with beam search translates gloss to natural sentences.

Intent Classification

Sentence-Transformers (all-MiniLM-L6-v2)

384-dimensional embeddings + cosine similarity classify questions into 5 intent categories: identity, time, date, location, general.

Answer Generation

Ollama + Llama 3.2 3B

Local LLM generates contextual answers with intent-aware prompts. Capped at 2 sentences for natural spoken output. No cloud API needed.

Robot Response

Reachy Mini SDK + MuJoCo + TTS

Reachy Mini nods and wiggles its antennas. MuJoCo renders the simulation. TTS speaks the answer aloud. Three response channels simultaneously.

11 Supported Signs

🖐️

HELLO

All 5 fingers open

✊

YES

Closed fist

☝️

YOU

Index finger only

✌️

WHAT

Index + Middle

👍

GOOD

Thumb only

🤙

Pinky only

👆

WHERE

Thumb + Index

🤙

HELP

Thumb + Pinky

🤟

WATER

Index + Middle + Ring

🖐️

STOP

4 fingers, no thumb

🤟

TIME

Thumb + Index + Middle

Three-Model Architecture

Three specialized models that no single approach can replace.

Vision

MediaPipe HandLandmarker

21 3D keypoints per hand, 30 FPS on CPU, TFLite runtime

Translation

Glossa-BART

Fine-tuned BART seq2seq for ASL gloss-to-English, beam search decoding

Reasoning

Llama 3.2 3B

Local LLM via Ollama for contextual answers, intent-aware prompts

Design Principles

Privacy-First

Everything runs locally. No cloud APIs, no telemetry, no data leaving the machine. The webcam feed is processed in-memory and never saved.

Respond, Don't Interrogate

The robot answers questions -- it doesn't ask users to repeat, rephrase, or slow down. If detection is uncertain, the system waits rather than guessing wrong.

Fail Gracefully

Every component has a fallback. Transformer won't load? Rules take over. Ollama down? Pre-written responses. No TTS? Text output. The system never crashes on a missing dependency.

Concise Answers

Responses capped at 2 sentences. Nobody wants to listen to a robot monologue. Short, clear, helpful.

Example Interactions

Sign: WHAT TIME

→

"It is 10:08 PM."

Sign: WHERE WATER

→

"I can help you find water nearby."

Sign: HELLO

→

"Hello! I'm Reachy Mini, a robot designed to understand ASL."

Sign: HELP YOU

→

"I'm here to help! What do you need?"

Quick Start

Install Ollama + pull model

curl -fsSL https://ollama.com/install.sh | sh && ollama pull llama3.2:3b

Start the Reachy Mini daemon

reachy-mini-daemon --sim --headless

Install and run

pip install -e . && python -m reachy_mini_asl_qa.main

Prerequisites: Python 3.10+, webcam, macOS or Linux

Tech Stack

Vision

MediaPipe HandLandmarker

Classification

Geometric Heuristics (Custom)

Translation

Glossa-BART (Seq2Seq)

Intent

all-MiniLM-L6-v2

LLM

Llama 3.2 3B (Ollama)

Simulation

MuJoCo Physics Engine

Robot

Reachy Mini SDK

Speech

Piper TTS / System TTS