Initial commit: Multi-service AI agent system

- Frontend: Vite + React + TypeScript chat interface - Backend: FastAPI gateway with LangGraph routing - Knowledge Service: ChromaDB RAG with Gitea scraper - LangGraph Service: Multi-agent orchestration - Airflow: Scheduled Gitea ingestion DAG - Documentation: Complete plan and implementation guides Architecture: - Modular Docker Compose per service - External ai-mesh network for communication - Fast rebuilds with /app/packages pattern - Intelligent agent routing (no hardcoded keywords) Services: - Frontend (5173): React chat UI - Chat Gateway (8000): FastAPI entry point - LangGraph (8090): Agent orchestration - Knowledge (8080): ChromaDB RAG - Airflow (8081): Scheduled ingestion - PostgreSQL (5432): Chat history Excludes: node_modules, .venv, chroma_db, logs, .env files Includes: All source code, configs, docs, docker files
2026-02-27 19:51:06 +11:00
commit 628ba96998
44 changed files with 7177 additions and 0 deletions
--- a/plan.md
+++ b/plan.md
@@ -0,0 +1,396 @@
+# Project Plan: aboutme_chat_demo
+
+## Goal
+Build a comprehensive AI agent system that ingests data from self-hosted services (Gitea, notes, wikis), stores it in a vector database, and provides intelligent responses through a multi-agent orchestration layer. The system emphasizes modular containerized architecture, industry-standard tools, and employment-relevant skills.
+
+---
+
+## Phase 1: Foundation & Core Infrastructure (COMPLETED)
+
+### Phase 1.1: Frontend Application
+**Location:** `/home/sam/development/aboutme_chat_demo/frontend/`
+
+**Stack & Tools:**
+- **Framework:** Vite 6.2.0 + React 19.0.0 + TypeScript
+- **Styling:** Tailwind CSS 4.0.0
+- **State Management:** TanStack Query (React Query) 5.67.0
+- **Build Tool:** Vite with React plugin
+- **Linting:** ESLint 9.21.0 + typescript-eslint 8.24.0
+
+**Components Implemented:**
+- `ChatInterface.tsx` - Auto-expanding text input with scrolling message list
+- `App.tsx` - Main application container
+- Real-time chat UI with message history
+- HTTP client integration to backend gateway
+
+**Docker Configuration:**
+- Hot-reload development setup
+- Volume mounting for instant code changes
+- Node modules isolation (`/app/node_modules`)
+
+### Phase 1.2: Chat Gateway (Orchestration Entry Point)
+**Location:** `/home/sam/development/aboutme_chat_demo/backend/`
+
+**Stack & Tools:**
+- **Framework:** FastAPI (Python 3.11)
+- **HTTP Client:** httpx 0.28.1
+- **CORS:** Configured for all origins (development)
+
+**Architecture Changes:**
+- **OLD:** Hardcoded keyword matching (`["sam", "hobby", "music", "guitar", "skiing", "experience"]`) to trigger knowledge lookup
+- **NEW:** Thin routing layer - all queries passed to LangGraph Supervisor for intelligent agent selection
+- Removed direct Brain (LLM) integration
+- Removed direct Knowledge Service calls
+- Now acts as stateless entry point to LangGraph orchestration layer
+
+**Endpoints:**
+- `POST /chat` - Routes queries to LangGraph Supervisor
+- `GET /health` - Service health check
+- `GET /agents` - Lists available agents from LangGraph
+
+### Phase 1.3: Knowledge Service (Librarian Agent)
+**Location:** `/home/sam/development/knowledge_service/`
+
+**Stack & Tools:**
+- **Framework:** FastAPI + Uvicorn
+- **Vector Database:** ChromaDB 1.5.1
+- **Embeddings:** OpenAI via OpenRouter API (text-embedding-3-small)
+- **LLM Framework:** LangChain ecosystem
+  - langchain 1.2.10
+  - langchain-community 0.4.1
+  - langchain-core 1.2.15
+  - langchain-text-splitters 1.1.1
+  - langchain-openai
+- **Document Processing:** RecursiveCharacterTextSplitter
+
+**Key Files:**
+- `main.py` - FastAPI endpoints for /query and /health
+- `gitea_scraper.py` - Gitea API integration module (NEW)
+- `data/hobbies.md` - Sample knowledge base content
+- `chroma_db/` - Persistent vector storage
+
+**Docker Architecture (Optimized):**
+- **Pattern:** Separate `/app/packages` (cached) from `/app/code` (volume-mounted)
+- **Benefits:** 
+  - Code changes apply instantly without rebuild
+  - Package installation happens once during image build
+  - PYTHONPATH=/app/packages ensures imports work
+- **Volumes:**
+  - `./data:/app/code/data` - Knowledge documents
+  - `./chroma_db:/app/code/chroma_db` - Vector database persistence
+  - `./main.py:/app/code/main.py:ro` - Read-only code mount
+
+### Phase 1.4: LangGraph Supervisor Service (NEW)
+**Location:** `/home/sam/development/langgraph_service/`
+
+**Stack & Tools:**
+- **Framework:** FastAPI + Uvicorn
+- **Orchestration:** LangGraph 1.0.9
+  - langgraph-checkpoint 4.0.0
+  - langgraph-prebuilt 1.0.8
+  - langgraph-sdk 0.3.9
+- **State Management:** TypedDict with Annotated operators
+- **Message Types:** LangChain Core Messages (HumanMessage, AIMessage)
+
+**Architecture:**
+- **Supervisor Node:** Analyzes queries and routes to specialist agents
+- **Agent Graph:** StateGraph with conditional edges
+- **Three Specialist Agents:**
+  1. **Librarian Agent** - Queries ChromaDB via knowledge-service:8080
+  2. **Opencode Agent** - Placeholder for coding tasks (MCP integration ready)
+  3. **Brain Agent** - Fallback to OpenCode Brain LLM (opencode-brain:5000)
+
+**Routing Logic:**
+```
+Query → Supervisor → [Librarian | Opencode | Brain]
+- "repo/code/git/project" → Librarian (RAG)
+- "write/edit/create/fix" → Opencode (Coding)
+- "sam/hobby/music/about" → Librarian (RAG)
+- Default → Brain (General LLM)
+```
+
+**Docker Configuration:**
+- Self-contained with own `/app/packages`
+- No package sharing with other services (modular)
+- Port 8090 exposed
+
+### Phase 1.5: Apache Airflow (Scheduled Ingestion)
+**Location:** `/home/sam/development/airflow/`
+
+**Stack & Tools:**
+- **Orchestration:** Apache Airflow 2.8.1
+- **Executor:** CeleryExecutor (distributed task processing)
+- **Database:** PostgreSQL 13 (metadata)
+- **Message Queue:** Redis (Celery broker)
+- **Services:**
+  - airflow-webserver (UI + API)
+  - airflow-scheduler (DAG scheduling)
+  - airflow-worker (task execution)
+  - airflow-triggerer (deferrable operators)
+
+**DAG: gitea_daily_ingestion**
+- **Schedule:** Daily
+- **Tasks:**
+  1. `fetch_repos` - Get all user repos from Gitea API
+  2. `fetch_readmes` - Download README files
+  3. `ingest_to_chroma` - Store in Knowledge Service
+
+**Integration:**
+- Mounts `knowledge_service/gitea_scraper.py` into DAGs folder
+- Environment variables for Gitea API token
+- Network: ai-mesh (communicates with knowledge-service)
+
+### Phase 1.6: Gitea Scraper Module
+**Location:** `/home/sam/development/knowledge_service/gitea_scraper.py`
+
+**Functionality:**
+- **API Integration:** Gitea REST API v1
+- **Authentication:** Token-based (Authorization header)
+- **Methods:**
+  - `get_user_repos()` - Paginated repo listing
+  - `get_readme(repo_name)` - README content with fallback names
+  - `get_repo_files(repo_name, path)` - Directory listing
+  - `get_file_content(repo_name, filepath)` - File download
+
+**Data Model:**
+- `RepoMetadata` dataclass (name, description, url, branch, updated_at, language)
+
+### Phase 1.7: Docker Infrastructure
+
+**Network:**
+- `ai-mesh` (external) - Shared bridge network for all services
+
+**Services Overview:**
+| Service | Port | Purpose | Dependencies |
+|---------|------|---------|--------------|
+| frontend | 5173 | React UI | backend |
+| backend | 8000 | Chat Gateway | langgraph-service, db |
+| db | 5432 | PostgreSQL (chat history) | - |
+| knowledge-service | 8080 | RAG / Vector DB | - |
+| langgraph-service | 8090 | Agent Orchestration | knowledge-service |
+| airflow-webserver | 8081 | Workflow UI | postgres, redis |
+| airflow-scheduler | - | DAG scheduling | postgres, redis |
+| airflow-worker | - | Task execution | postgres, redis |
+| redis | 6379 | Message broker | - |
+| postgres (airflow) | - | Airflow metadata | - |
+
+**Container Patterns:**
+- All Python services use `/app/packages` + `/app/code` separation
+- Node.js services use volume mounting for hot reload
+- PostgreSQL uses named volumes for persistence
+- External network (`ai-mesh`) for cross-service communication
+
+---
+
+## Phase 2: Multi-Source Knowledge Ingestion (IN PROGRESS)
+
+### Goal
+Expand beyond Gitea to ingest data from all self-hosted knowledge sources.
+
+### Data Sources to Integrate:
+1. **Notes & Documentation**
+   - **Trilium Next** - Hierarchical note-taking (tree structure)
+   - **Obsidian** - Markdown vault with backlinks
+   - **Flatnotes** - Flat file markdown notes
+   - **HedgeDoc** - Collaborative markdown editor
+
+2. **Wiki**
+   - **DokuWiki** - Structured wiki content
+
+3. **Project Management**
+   - **Vikunja** - Task lists and project tracking
+
+4. **Media & Assets**
+   - **Immich** - Photo/video metadata + Gemini Vision API for content description
+   - **HomeBox** - Physical inventory with images
+
+### Technical Approach:
+- **Crawling:** Selenium/Playwright for JavaScript-heavy UIs
+- **Extraction:** Firecrawl or LangChain loaders for structured content
+- **Vision:** Gemini Vision API for image-to-text conversion
+- **Storage:** ChromaDB (vectors) + PostgreSQL (metadata, hashes for deduplication)
+- **Scheduling:** Additional Airflow DAGs per source
+
+---
+
+## Phase 3: Advanced Agent Capabilities
+
+### Goal
+Integrate external AI tools and expand agent capabilities.
+
+### Agent Tooling:
+1. **MCP (Model Context Protocol) Servers**
+   - Git MCP - Local repository operations
+   - Filesystem MCP - Secure file access
+   - Memory MCP - Knowledge graph persistence
+   - Custom Gitea MCP (if/when available)
+
+2. **External Agents**
+   - **Goose** - CLI-based agent for local task execution
+   - **Aider** - AI pair programming
+   - **Opencode** - Already integrated (Brain Agent)
+   - **Automaker** - Workflow automation
+   - **Autocoder** - Code generation
+
+3. **Orchestration Tools**
+   - **CAO CLI** - Agent orchestrator
+   - **Agent Pipe** - Pipeline management
+
+### Integration Pattern:
+- Each external tool wrapped as LangGraph node
+- Supervisor routes to appropriate specialist
+- State management for multi-turn interactions
+
+---
+
+## Phase 4: Production Hardening
+
+### Goal
+Prepare system for production deployment.
+
+### Authentication & Security:
+- **Laravel** - User authentication service (Phase 4 original plan)
+- **JWT tokens** - Session management
+- **API key management** - Secure credential storage
+- **Network policies** - Inter-service communication restrictions
+
+### Monitoring & Observability:
+- **LangSmith** - LLM tracing and debugging
+- **Langfuse** - LLM observability (note: currently in per-project install list)
+- **Prometheus/Grafana** - Metrics and dashboards
+- **Airflow monitoring** - DAG success/failure alerting
+
+### Scaling:
+- **ChromaDB** - Migration to server mode for concurrent access
+- **Airflow** - Multiple Celery workers
+- **Load balancing** - Nginx reverse proxy
+- **Backup strategies** - Vector DB snapshots, PostgreSQL dumps
+
+---
+
+## Phase 5: Workflow Automation & Visual Tools
+
+### Goal
+Add visual prototyping and automation capabilities.
+
+### Tools to Integrate:
+1. **Flowise** - Visual LangChain builder
+   - Prototype agent flows without coding
+   - Export to Python code
+   - Debug RAG pipelines visually
+
+2. **Windmill** - Turn scripts into workflows
+   - Schedule Python/LangChain scripts
+   - Reactive triggers (e.g., on-commit)
+   - Low-code workflow builder
+
+3. **Activepieces** - Event-driven automation
+   - Webhook triggers from Gitea
+   - Integration with external APIs
+   - Visual workflow designer
+
+4. **N8N** - Alternative workflow automation
+   - Consider if Activepieces doesn't meet needs
+
+### Use Cases:
+- **On-commit triggers:** Gitea push → immediate re-scan → notification
+- **Scheduled reports:** Weekly summary of new/updated projects
+- **Reactive workflows:** New photo uploaded → Gemini Vision → update knowledge base
+
+---
+
+## Phase 6: Knowledge Library Options & RAG Enhancement
+
+### Goal
+Advanced retrieval and knowledge organization.
+
+### RAG Pipeline Improvements:
+1. **Hybrid Search**
+   - Semantic search (ChromaDB) + Keyword search (PostgreSQL)
+   - Re-ranking with cross-encoders
+   - Query expansion and decomposition
+
+2. **Multi-Modal RAG**
+   - Image retrieval (Immich + CLIP embeddings)
+   - Document parsing (PDFs, code files)
+   - Structured data (tables, lists)
+
+3. **Knowledge Organization**
+   - Entity extraction and linking
+   - Knowledge graph construction
+   - Hierarchical chunking strategies
+
+### Alternative Vector Stores (Evaluation):
+- **pgvector** - PostgreSQL native (if ChromaDB limitations hit)
+- **Weaviate** - GraphQL interface, hybrid search
+- **Qdrant** - Rust-based, high performance
+- **Milvus** - Enterprise-grade, distributed
+
+---
+
+## Phase 7: User Experience & Interface
+
+### Goal
+Enhanced frontend and interaction patterns.
+
+### Frontend Enhancements:
+1. **Chat Interface Improvements**
+   - Streaming responses (Server-Sent Events)
+   - Message threading and context
+   - File upload for document ingestion
+   - Image display (for Immich integration)
+
+2. **Knowledge Browser**
+   - View ingested documents
+   - Search knowledge base directly
+   - See confidence scores and sources
+   - Manual document upload/ingestion trigger
+
+3. **Agent Management**
+   - View active agents
+   - Configure agent behavior
+   - Monitor agent performance
+   - Override routing decisions
+
+### Mobile & Accessibility:
+- Responsive design improvements
+- Mobile app (React Native or PWA)
+- Accessibility compliance (WCAG)
+
+---
+
+## Technology Stack Summary
+
+### Core Frameworks:
+- **Backend:** FastAPI (Python 3.11)
+- **Frontend:** Vite + React 19 + TypeScript
+- **Styling:** Tailwind CSS
+- **Database:** PostgreSQL 15
+- **Vector DB:** ChromaDB 1.5.1
+
+### AI/ML Stack:
+- **LLM Orchestration:** LangGraph 1.0.9 + LangChain
+- **Embeddings:** OpenAI via OpenRouter (text-embedding-3-small)
+- **LLM:** OpenCode Brain (opencode-brain:5000)
+- **Vision:** Gemini Vision API (Phase 2)
+
+### Workflow & Scheduling:
+- **Orchestration:** Apache Airflow 2.8.1 (CeleryExecutor)
+- **Message Queue:** Redis
+- **External Tools:** Flowise, Windmill, Activepieces
+
+### Development Tools:
+- **Containers:** Docker + Docker Compose
+- **Networking:** Bridge network (ai-mesh)
+- **Testing:** curl/httpx for API testing
+- **Version Control:** Gitea (self-hosted)
+
+### Skills Demonstrated:
+- Containerized microservices architecture
+- Multi-agent AI orchestration (LangGraph)
+- Vector database implementation (RAG)
+- ETL pipeline development (Airflow)
+- API integration and web scraping
+- Modular, maintainable code organization
+- Industry-standard AI tooling (LangChain ecosystem)
+- Workflow automation and scheduling