Publish local repo state

2026-02-08 13:09:33 +11:00
parent 1fb54400b4
commit de91475de5
33 changed files with 599 additions and 56064 deletions
--- a/ai_dev_plan.md
+++ b/ai_dev_plan.md
@@ -0,0 +1,75 @@
+# Personal AI Agent: "About Me" Profile Generator
+
+**Project Goal**  
+Build a showcase AI system that scans and summarizes your professional/personal work from self-hosted services (primarily Gitea for code/repos, plus Flatnotes/Trillium/HedgeDoc for notes/ideas/projects). The agent answers employer-style questions dynamically (e.g., "Summarize Giordano's coding projects and skills") with RAG-grounded responses, links, and image embeds where relevant.  
+
+Emphasize broad AI toolchain integration for skill development and portfolio impact: agentic workflows, RAG pipelines, orchestration, multi-LLM support. No frontend focus — terminal/API-triggered queries only.
+
+**Key Features**  
+- Periodic/full scanning of services to extract text, summaries, code snippets, links, images.  
+- Populate & query a local vector DB (RAG) for semantic search.  
+- Agent reasons, retrieves, generates responses with evidence (links/images).  
+- Multi-LLM fallback (DeepSeek primary, Gemini/OpenCode trigger).  
+- Scheduled/automated updates via pipelines.  
+- Local/Docker deployment for privacy & control.
+
+**Tools & Stack Overview**  
+
+| Category              | Tool(s)                          | Purpose & Why Chosen                                                                 | Integration Role |
+|-----------------------|----------------------------------|--------------------------------------------------------------------------------------|------------------|
+| Core Framework       | LangChain / LangGraph           | Build agent, tools, chains, RAG logic. Modular, industry-standard for LLM apps.     | Heart of agent & retrieval |
+| Crawling/Extraction  | Selenium / Playwright + Firecrawl (via LangChain loaders) | Handle auth/dynamic pages (Gitea login/nav), structured extraction (Markdown/JSON). | Scan web views & APIs |
+| Vector Database      | Chroma                          | Local, lightweight RAG store. Easy Docker setup, native LangChain integration.      | Store embeddings for fast semantic search |
+| LLM(s)               | DeepSeek (via API) + Gemini / OpenCode | DeepSeek: cheap, strong reasoning (primary). Gemini/OpenCode: terminal trigger/fallback. | Reasoning & generation |
+| Data Pipeline / Scheduling | Apache Airflow (Docker)     | Industry-best for ETL/ETL-like scans (DAGs). Local install via official Compose.    | Schedule periodic scans/updates to Chroma |
+| Visual Prototyping   | Flowise                         | No-code visual builder on LangChain. Quick agent/RAG prototyping & debugging.       | Experiment with chains before code |
+| Script/Workflow Orchestration | Windmill                   | Turn Python/LangChain scripts into reusable, scheduled flows. Dev-first, high growth.| Reactive workflows (e.g., on-commit triggers) |
+| Event-Driven Automation | Activepieces                 | Connect services event-based (e.g., Gitea webhook → re-scan). AI-focused pieces.    | Glue for reactive triggers |
+
+**High-Level Architecture & Flow**
+
+1. **Ingestion Pipeline (Airflow + Crawlers)**  
+   - Airflow DAG runs on schedule (daily/weekly) or manually.  
+   - Task 1: LangChain agent uses Selenium/Playwright tool to browse/authenticate to services (e.g., Gitea repos, Flatnotes/Trillium pages).  
+   - Task 2: Firecrawl loader extracts structured content (text, code blocks, links, image URLs).  
+   - Task 3: LangChain chunks, embeds (DeepSeek embeddings), upserts to Chroma vector DB.  
+   - Optional: Activepieces listens for events (e.g., Gitea push webhook) → triggers partial re-scan.
+
+2. **Agent Runtime (LangChain/LangGraph + DeepSeek)**  
+   - Core agent (ReAct-style): Receives query (e.g., via terminal/OpenCode: "opencode query 'Giordano's top projects'").  
+   - Tools: Retrieve from Chroma (RAG), fetch specific pages/images if needed.  
+   - LLM: DeepSeek for cost-effective reasoning/summarization. Fallback to Gemini if complex.  
+   - Output: Natural response with summaries, links (e.g., Gitea repo URLs), embedded image previews (from scanned pages).
+
+3. **Prototyping & Orchestration Layer**  
+   - Use Flowise to visually build/test agent chains/RAG flows before committing to code.  
+   - Windmill wraps scripts (e.g., scan script) as jobs/APIs.  
+   - Activepieces adds event-driven glue (e.g., new note in Trillium → notify/update DB).
+
+**Deployment & Running Locally**  
+- Everything in Docker Compose: Airflow (official image), Chroma, Python services (LangChain agent), optional Flowise/Windmill containers.  
+- Secrets: Env vars for API keys (DeepSeek, service auth).  
+- Trigger: Terminal via OpenCode/Gemini CLI → calls agent endpoint/script.  
+- Scale: Start simple (manual scans), add Airflow scheduling later.
+
+**Skill Showcase & Portfolio Value**  
+- Demonstrates: Agentic AI, RAG pipelines, web crawling with auth, multi-tool orchestration, cost-optimized LLMs, local/self-hosted infra.  
+- Broad coverage: LangChain ecosystem + industry ETL (Airflow) + modern AI workflow tools (Flowise/Windmill/Activepieces).  
+- Low cost: DeepSeek keeps API bills minimal (often <$5/month even with frequent scans/queries).
+
+**Next Steps (Implementation Phases)**  
+1. Setup local Docker env + Chroma + DeepSeek API key.  
+2. Build basic crawler tools (Selenium + Firecrawl) for Gitea/Flatnotes.  
+3. Prototype agent in Flowise, then code in LangChain.  
+4. Add Airflow DAG for scheduled ingestion.  
+5. Integrate Windmill/Activepieces for extras.  
+6. Test queries, refine summaries/links/images.
+
+This setup positions you strongly for AI engineering roles while building real, integrated skills.
+
+** Extra tools to add.
+- AutoMaker
+- AutoCoder - These assist in set and forget long review AI
+- OpenRouter - Single access point for any CLI with useage fee.
+- Aider - CLI code and file editing with OpenROuter for any model
+- Goose - integrates with system and MCP servers like ClawBot