aboutme_chat/plan.md

# Project Plan: aboutme_chat_demo

## Goal
Build a comprehensive AI agent system that ingests data from self-hosted services (Gitea, notes, wikis), stores it in a vector database, and provides intelligent responses through a multi-agent orchestration layer. The system emphasizes modular containerized architecture, industry-standard tools, and employment-relevant skills.

---

## Phase 1: Foundation & Core Infrastructure (COMPLETED)

### Phase 1.1: Frontend Application
**Location:** `/home/sam/development/aboutme_chat_demo/frontend/`

**Stack & Tools:**
- **Framework:** Vite 6.2.0 + React 19.0.0 + TypeScript
- **Styling:** Tailwind CSS 4.0.0
- **State Management:** TanStack Query (React Query) 5.67.0
- **Build Tool:** Vite with React plugin
- **Linting:** ESLint 9.21.0 + typescript-eslint 8.24.0

**Components Implemented:**
- `ChatInterface.tsx` - Auto-expanding text input with scrolling message list
- `App.tsx` - Main application container
- Real-time chat UI with message history
- HTTP client integration to backend gateway

**Docker Configuration:**
- Hot-reload development setup
- Volume mounting for instant code changes
- Node modules isolation (`/app/node_modules`)

### Phase 1.2: Chat Gateway (Orchestration Entry Point)
**Location:** `/home/sam/development/aboutme_chat_demo/backend/`

**Stack & Tools:**
- **Framework:** FastAPI (Python 3.11)
- **HTTP Client:** httpx 0.28.1
- **CORS:** Configured for all origins (development)

**Architecture Changes:**
- **OLD:** Hardcoded keyword matching (`["sam", "hobby", "music", "guitar", "skiing", "experience"]`) to trigger knowledge lookup
- **NEW:** Thin routing layer - all queries passed to LangGraph Supervisor for intelligent agent selection
- Removed direct Brain (LLM) integration
- Removed direct Knowledge Service calls
- Now acts as stateless entry point to LangGraph orchestration layer

**Endpoints:**
- `POST /chat` - Routes queries to LangGraph Supervisor
- `GET /health` - Service health check
- `GET /agents` - Lists available agents from LangGraph

### Phase 1.3: Knowledge Service (Librarian Agent)
**Location:** `/home/sam/development/knowledge_service/`

**Stack & Tools:**
- **Framework:** FastAPI + Uvicorn
- **Vector Database:** ChromaDB 1.5.1
- **Embeddings:** OpenAI via OpenRouter API (text-embedding-3-small)
- **LLM Framework:** LangChain ecosystem
  - langchain 1.2.10
  - langchain-community 0.4.1
  - langchain-core 1.2.15
  - langchain-text-splitters 1.1.1
  - langchain-openai
- **Document Processing:** RecursiveCharacterTextSplitter

**Key Files:**
- `main.py` - FastAPI endpoints for /query and /health
- `gitea_scraper.py` - Gitea API integration module (NEW)
- `data/hobbies.md` - Sample knowledge base content
- `chroma_db/` - Persistent vector storage

**Docker Architecture (Optimized):**
- **Pattern:** Separate `/app/packages` (cached) from `/app/code` (volume-mounted)
- **Benefits:**
  - Code changes apply instantly without rebuild
  - Package installation happens once during image build
  - PYTHONPATH=/app/packages ensures imports work
- **Volumes:**
  - `./data:/app/code/data` - Knowledge documents
  - `./chroma_db:/app/code/chroma_db` - Vector database persistence
  - `./main.py:/app/code/main.py:ro` - Read-only code mount

### Phase 1.4: LangGraph Supervisor Service (NEW)
**Location:** `/home/sam/development/langgraph_service/`

**Stack & Tools:**
- **Framework:** FastAPI + Uvicorn
- **Orchestration:** LangGraph 1.0.9
  - langgraph-checkpoint 4.0.0
  - langgraph-prebuilt 1.0.8
  - langgraph-sdk 0.3.9
- **State Management:** TypedDict with Annotated operators
- **Message Types:** LangChain Core Messages (HumanMessage, AIMessage)

**Architecture:**
- **Supervisor Node:** Analyzes queries and routes to specialist agents
- **Agent Graph:** StateGraph with conditional edges
- **Three Specialist Agents:**
  1. **Librarian Agent** - Queries ChromaDB via knowledge-service:8080
  2. **Opencode Agent** - Placeholder for coding tasks (MCP integration ready)
  3. **Brain Agent** - Fallback to OpenCode Brain LLM (opencode-brain:5000)

**Routing Logic:**
```
Query → Supervisor → [Librarian | Opencode | Brain]
- "repo/code/git/project" → Librarian (RAG)
- "write/edit/create/fix" → Opencode (Coding)
- "sam/hobby/music/about" → Librarian (RAG)
- Default → Brain (General LLM)
```

**Docker Configuration:**
- Self-contained with own `/app/packages`
- No package sharing with other services (modular)
- Port 8090 exposed

### Phase 1.5: Apache Airflow (Scheduled Ingestion)
**Location:** `/home/sam/development/airflow/`

**Stack & Tools:**
- **Orchestration:** Apache Airflow 2.8.1
- **Executor:** CeleryExecutor (distributed task processing)
- **Database:** PostgreSQL 13 (metadata)
- **Message Queue:** Redis (Celery broker)
- **Services:**
  - airflow-webserver (UI + API)
  - airflow-scheduler (DAG scheduling)
  - airflow-worker (task execution)
  - airflow-triggerer (deferrable operators)

**DAG: gitea_daily_ingestion**
- **Schedule:** Daily
- **Tasks:**
  1. `fetch_repos` - Get all user repos from Gitea API
  2. `fetch_readmes` - Download README files
  3. `ingest_to_chroma` - Store in Knowledge Service

**Integration:**
- Mounts `knowledge_service/gitea_scraper.py` into DAGs folder
- Environment variables for Gitea API token
- Network: ai-mesh (communicates with knowledge-service)

### Phase 1.6: Gitea Scraper Module
**Location:** `/home/sam/development/knowledge_service/gitea_scraper.py`

**Functionality:**
- **API Integration:** Gitea REST API v1
- **Authentication:** Token-based (Authorization header)
- **Methods:**
  - `get_user_repos()` - Paginated repo listing
  - `get_readme(repo_name)` - README content with fallback names
  - `get_repo_files(repo_name, path)` - Directory listing
  - `get_file_content(repo_name, filepath)` - File download

**Data Model:**
- `RepoMetadata` dataclass (name, description, url, branch, updated_at, language)

### Phase 1.7: Docker Infrastructure

**Network:**
- `ai-mesh` (external) - Shared bridge network for all services

**Services Overview:**
| Service | Port | Purpose | Dependencies |
|---------|------|---------|--------------|
| frontend | 5173 | React UI | backend |
| backend | 8000 | Chat Gateway | langgraph-service, db |
| db | 5432 | PostgreSQL (chat history) | - |
| knowledge-service | 8080 | RAG / Vector DB | - |
| langgraph-service | 8090 | Agent Orchestration | knowledge-service |
| airflow-webserver | 8081 | Workflow UI | postgres, redis |
| airflow-scheduler | - | DAG scheduling | postgres, redis |
| airflow-worker | - | Task execution | postgres, redis |
| redis | 6379 | Message broker | - |
| postgres (airflow) | - | Airflow metadata | - |

**Container Patterns:**
- All Python services use `/app/packages` + `/app/code` separation
- Node.js services use volume mounting for hot reload
- PostgreSQL uses named volumes for persistence
- External network (`ai-mesh`) for cross-service communication

---

## Phase 2: Multi-Source Knowledge Ingestion (IN PROGRESS)

### Goal
Expand beyond Gitea to ingest data from all self-hosted knowledge sources.

### Data Sources to Integrate:
1. **Notes & Documentation**
   - **Trilium Next** - Hierarchical note-taking (tree structure)
   - **Obsidian** - Markdown vault with backlinks
   - **Flatnotes** - Flat file markdown notes
   - **HedgeDoc** - Collaborative markdown editor

2. **Wiki**
   - **DokuWiki** - Structured wiki content

3. **Project Management**
   - **Vikunja** - Task lists and project tracking

4. **Media & Assets**
   - **Immich** - Photo/video metadata + Gemini Vision API for content description
   - **HomeBox** - Physical inventory with images

### Technical Approach:
- **Crawling:** Selenium/Playwright for JavaScript-heavy UIs
- **Extraction:** Firecrawl or LangChain loaders for structured content
- **Vision:** Gemini Vision API for image-to-text conversion
- **Storage:** ChromaDB (vectors) + PostgreSQL (metadata, hashes for deduplication)
- **Scheduling:** Additional Airflow DAGs per source

---

## Phase 3: Advanced Agent Capabilities

### Goal
Integrate external AI tools and expand agent capabilities.

### Agent Tooling:
1. **MCP (Model Context Protocol) Servers**
   - Git MCP - Local repository operations
   - Filesystem MCP - Secure file access
   - Memory MCP - Knowledge graph persistence
   - Custom Gitea MCP (if/when available)

2. **External Agents**
   - **Goose** - CLI-based agent for local task execution
   - **Aider** - AI pair programming
   - **Opencode** - Already integrated (Brain Agent)
   - **Automaker** - Workflow automation
   - **Autocoder** - Code generation

3. **Orchestration Tools**
   - **CAO CLI** - Agent orchestrator
   - **Agent Pipe** - Pipeline management

### Integration Pattern:
- Each external tool wrapped as LangGraph node
- Supervisor routes to appropriate specialist
- State management for multi-turn interactions

---

## Phase 4: Production Hardening

### Goal
Prepare system for production deployment.

### Authentication & Security:
- **Laravel** - User authentication service (Phase 4 original plan)
- **JWT tokens** - Session management
- **API key management** - Secure credential storage
- **Network policies** - Inter-service communication restrictions

### Monitoring & Observability:
- **LangSmith** - LLM tracing and debugging
- **Langfuse** - LLM observability (note: currently in per-project install list)
- **Prometheus/Grafana** - Metrics and dashboards
- **Airflow monitoring** - DAG success/failure alerting

### Scaling:
- **ChromaDB** - Migration to server mode for concurrent access
- **Airflow** - Multiple Celery workers
- **Load balancing** - Nginx reverse proxy
- **Backup strategies** - Vector DB snapshots, PostgreSQL dumps

---

## Phase 5: Workflow Automation & Visual Tools

### Goal
Add visual prototyping and automation capabilities.

### Tools to Integrate:
1. **Flowise** - Visual LangChain builder
   - Prototype agent flows without coding
   - Export to Python code
   - Debug RAG pipelines visually

2. **Windmill** - Turn scripts into workflows
   - Schedule Python/LangChain scripts
   - Reactive triggers (e.g., on-commit)
   - Low-code workflow builder

3. **Activepieces** - Event-driven automation
   - Webhook triggers from Gitea
   - Integration with external APIs
   - Visual workflow designer

4. **N8N** - Alternative workflow automation
   - Consider if Activepieces doesn't meet needs

### Use Cases:
- **On-commit triggers:** Gitea push → immediate re-scan → notification
- **Scheduled reports:** Weekly summary of new/updated projects
- **Reactive workflows:** New photo uploaded → Gemini Vision → update knowledge base

---

## Phase 6: Knowledge Library Options & RAG Enhancement

### Goal
Advanced retrieval and knowledge organization.

### RAG Pipeline Improvements:
1. **Hybrid Search**
   - Semantic search (ChromaDB) + Keyword search (PostgreSQL)
   - Re-ranking with cross-encoders
   - Query expansion and decomposition

2. **Multi-Modal RAG**
   - Image retrieval (Immich + CLIP embeddings)
   - Document parsing (PDFs, code files)
   - Structured data (tables, lists)

3. **Knowledge Organization**
   - Entity extraction and linking
   - Knowledge graph construction
   - Hierarchical chunking strategies

### Alternative Vector Stores (Evaluation):
- **pgvector** - PostgreSQL native (if ChromaDB limitations hit)
- **Weaviate** - GraphQL interface, hybrid search
- **Qdrant** - Rust-based, high performance
- **Milvus** - Enterprise-grade, distributed

---

## Phase 7: User Experience & Interface

### Goal
Enhanced frontend and interaction patterns.

### Frontend Enhancements:
1. **Chat Interface Improvements**
   - Streaming responses (Server-Sent Events)
   - Message threading and context
   - File upload for document ingestion
   - Image display (for Immich integration)

2. **Knowledge Browser**
   - View ingested documents
   - Search knowledge base directly
   - See confidence scores and sources
   - Manual document upload/ingestion trigger

3. **Agent Management**
   - View active agents
   - Configure agent behavior
   - Monitor agent performance
   - Override routing decisions

### Mobile & Accessibility:
- Responsive design improvements
- Mobile app (React Native or PWA)
- Accessibility compliance (WCAG)

---

## Technology Stack Summary

### Core Frameworks:
- **Backend:** FastAPI (Python 3.11)
- **Frontend:** Vite + React 19 + TypeScript
- **Styling:** Tailwind CSS
- **Database:** PostgreSQL 15
- **Vector DB:** ChromaDB 1.5.1

### AI/ML Stack:
- **LLM Orchestration:** LangGraph 1.0.9 + LangChain
- **Embeddings:** OpenAI via OpenRouter (text-embedding-3-small)
- **LLM:** OpenCode Brain (opencode-brain:5000)
- **Vision:** Gemini Vision API (Phase 2)

### Workflow & Scheduling:
- **Orchestration:** Apache Airflow 2.8.1 (CeleryExecutor)
- **Message Queue:** Redis
- **External Tools:** Flowise, Windmill, Activepieces

### Development Tools:
- **Containers:** Docker + Docker Compose
- **Networking:** Bridge network (ai-mesh)
- **Testing:** curl/httpx for API testing
- **Version Control:** Gitea (self-hosted)

### Skills Demonstrated:
- Containerized microservices architecture
- Multi-agent AI orchestration (LangGraph)
- Vector database implementation (RAG)
- ETL pipeline development (Airflow)
- API integration and web scraping
- Modular, maintainable code organization
- Industry-standard AI tooling (LangChain ecosystem)
- Workflow automation and scheduling