Quick Start
This guide walks you through getting Cognitive Companion running on your local network.
Prerequisites
| Component | Purpose | Notes |
|---|---|---|
| NVIDIA GPU (32 GB minimum, 48 GB recommended) | Hosts the general LLM, the vision-language model, and the Triton vision models | RTX 5090 (32 GB) at minimum; A6000 or L40S (48 GB) for headroom; or split across GPUs. See GPU memory budget. |
| Docker + NVIDIA Container Toolkit | Container runtime | For all services |
| Home Assistant | Sensor integration, audio playback, actions | REST API + long-lived token |
| MinIO (or S3-compatible) | Media object storage | Pre-signed URL support required |
| vLLM | Serves the vision-language model | Cosmos-Reason2-8B at FP8, OpenAI-compatible API |
llama.cpp llama-server | Serves the general reasoning model | Gemma 4 26B-A4B (MoE) at FP4, OpenAI-compatible API |
| Triton Inference Server | Vision, face, and embedding models | Detection, ReID, pose, face, CLIP, Florence-2, embeddinggemma-300m |
| Python 3.14 | Backend runtime | |
| uv | Python package manager | For local development |
| Node.js 24.16.x | Frontend build | For admin console, WebSocket audio interface |
Optional Components
| Component | Purpose |
|---|---|
| Telegram Bot | Caregiver alert notifications |
| Google Gemini API | Real-time voice conversations |
| TTS service | Text-to-speech announcements |
GPU memory budget
The system runs several models on the GPU at once. The two language models dominate the budget; the perception models are individually small but add up. The table below breaks the requirement down by component, at the precision each model is served at today.
| Component | What runs | Precision | Approx. VRAM |
|---|---|---|---|
| General reasoning LLM | Gemma 4 26B-A4B (MoE, all experts resident) | FP4 (llama.cpp) | ~13 GB |
| Vision-language model | Cosmos-Reason2-8B (vLLM) | FP8 | ~8 GB |
| Knowledge embeddings | embeddinggemma-300m (Triton) | FP32 ONNX | ~1.2 GB |
| Scene analysis | CLIP ViT-L/14 + Florence-2-large (Triton) | ONNX / INT8 | ~2 GB |
| Multi-camera tracking | YOLO26L + Swin ReID + RTMPose (Triton) | FP32 ONNX | ~0.3 GB |
| Face recognition | Buffalo_L: SCRFD + ArcFace R50 + landmarks (Triton) | FP32 ONNX | ~0.35 GB |
| Weights subtotal | ~25 GB |
A few things this table does not include, which you still need to budget for:
- vLLM KV cache. Cosmos-Reason2 is served with
--max-model-len=16384and--max-num-seqs=4, and the deployment uses--quantization=fp8with--kv-cache-dtype=fp8at--gpu-memory-utilization=0.25. KV cache grows with context length and the number of concurrent sequences. - Per-process CUDA context of roughly 0.5 to 1 GB for each model server and for Triton.
- The general LLM precision is the biggest variable. The ~13 GB figure assumes FP4 (4-bit) quantization on llama.cpp, which is the current setup. FP8 roughly doubles it and BF16 roughly quadruples it. Because it is a mixture-of-experts model, all experts stay resident in VRAM even though only about 4B parameters activate per token, so memory tracks the 26B total, not the 4B active. The
qwen-3.6-35balternative inconfig/settings.yamlis larger, roughly 18 GB at FP4.
Two practical consequences:
- A 32 GB GPU can host the full stack, with a 48 GB card giving comfortable headroom for KV cache and concurrency. You can also split the models across GPUs. The perception models have INT8 and Jetson-quantized variants (
continuous-tracking/triton-models-jetson/) that bring them to under 1 GB combined, but the LLM and the vision-language model are still what set the floor. - The model servers talk over OpenAI-compatible URLs and Triton gRPC, so the LLM and VLM can run on a separate host or GPU from the perception stack. Point
VISION_MODEL_URL,GEMMA_MODEL_URL, andEMBEDDING_TRITON_URLat wherever they run.
Reference hardware: offload perception to a Jetson
A reference split runs the two language models on a main GPU and moves the latency-sensitive vision and face models to a Jetson Orin Nano Super (8 GB unified memory) acting as an inference appliance:
- Main GPU host: general LLM, vision-language model, knowledge embeddings, and scene analysis (CLIP + Florence-2).
- Jetson Orin Nano Super: detector (YOLO26L), pose (RTMPose-m), body ReID (SOLIDER), and face detection and recognition (SCRFD + ArcFace from
buffalo_l), all served as selective INT8 TensorRT plans.
These perception models are small in VRAM, about 0.65 GB on the main GPU, so the point of the split is not to reclaim much memory. It is to keep per-frame inference and face embeddings on a low-power box on the local network, off the GPU that serves the LLM. Qualify six cameras first; eight is conditional on the detector p95 staying under 140 ms with stable memory headroom.
See Run CTS inference on Jetson Orin Nano Super for the model-by-model quantization recipe, qualification gates, and production metrics.
Step 1: Configure Environment
git clone https://github.com/SilverMind-Project/cognitive-companion.git
cd cognitive-companion
cp .env.example .envEdit .env with your service URLs and API keys:
# LLM Providers
VISION_MODEL_URL=http://localhost:8001 # vLLM (Cosmos-Reason2-8B, FP8)
GEMMA_MODEL_URL=http://localhost:8100 # llama.cpp (Gemma 4 26B-A4B, FP4)
# Home Assistant
HOME_ASSISTANT_URL=http://homeassistant.local:8123
HOME_ASSISTANT_TOKEN=your_long_lived_access_token
# Object Storage
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
# Person Identification
PERSON_ID_SERVICE_URL=http://localhost:8200
# Authentication
CC_ADMIN_API_KEY=your_admin_key
CC_CAREGIVER_API_KEY=your_caregiver_key
CC_MCP_API_KEY=your_mcp_keyReview config/settings.yaml for application behavior: event aggregation windows, LLM model names, polling intervals, and more. See Configuration for a full reference.
Step 2: Start All Services
Option A: Docker Compose
# 1. Start the shared PostgreSQL instance first (hosts all 3 project databases)
docker compose -f docker-compose.db.yml -p nanai up -d
# 2. Start each subproject (pulls in shared DB via include)
cd cognitive-companion && docker compose up -d
cd ../continuous-tracking && docker compose --profile app up -d
cd ../semantic-memory-service && docker compose up -d
# 3. Initialize the cognitive-companion database
cd ../cognitive-companion && make init-db
# 4. Verify
curl http://localhost:8000/api/v1/health # Backend
curl http://localhost:8400/health # Semantic MemoryDocker Compose handles inter-service networking automatically. The shared timescale/timescaledb-ha:pg18 container hosts cognitive_companion, continuous_tracking, and semantic_memory databases. Each service connects with its own database user.
TIP
The person-ID service requires GPU access. Ensure the NVIDIA Container Toolkit is installed.
See Deployment for the full Docker Compose and Kubernetes reference.
Option B: Run Services Individually
Start the Person Identification Service (GPU-accelerated face recognition):
cd ../person-identification-service
docker build -t person-id-service .
docker run --gpus all -p 8200:8200 -v ./data:/app/data person-id-serviceSee the Person Identification Service README for enrollment instructions and API documentation.
Start the Backend:
# With Docker
docker build -t cognitive-companion .
docker run -p 8000:8000 \
-v ./data:/app/data \
-v ./config:/app/config \
--env-file .env \
cognitive-companion
# Or for local development (requires uv: https://docs.astral.sh/uv/)
cd backend && uv sync --extra gemini && cd ..
uv run --directory backend uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reloadThe gemini extra installs the google-genai package for voice companion support. Omit it if you don't need real-time voice.
Start the Frontend:
cd frontend
npm install
npm run dev # Development server at http://localhost:5173For production, the frontend is containerized with nginx:
cd frontend
docker build -t cognitive-companion-ui .
docker run -p 80:80 cognitive-companion-uiStep 3: Initial Setup
- Open the admin console at
http://localhost:5173/admin - Set your admin API key in the settings
- Create rooms. Define the physical spaces in your home (kitchen, bedroom, etc.)
- Register sensors. Add cameras and presence sensors, assigning each to a room
- Enroll household members. Go to Members & Enrollment, register each person, then click the face-recognition icon to upload 5-10 reference photos per person
- Create rules. Use the visual pipeline builder to assemble step graphs
Your First Rule
A basic camera monitoring rule might look like:
person_identification → llm_call (vision) → llm_call (reasoning) → notification- Go to Rules → New Rule, enter a name, and click Create - you'll land on the rule detail page
- On the Settings tab, set the trigger type to
sensor_eventand bind it to a camera sensor - Switch to the Pipeline tab, add nodes from the palette, and connect them with edges:
- Person Identification: identify who is in the frame
- LLM Call (vision, e.g. Cosmos Reason2): describe what is happening
- LLM Call (reasoning, e.g. Gemma 4): decide if a notification is warranted
- Notification: send the alert to configured channels
- Configure each step's settings in its config dialog
- Enable the rule and save
The rule will now execute whenever the bound camera sends an event. You can monitor live and historical runs in the Executions view and inspect pipeline data in the Events log.