Publish local repo state

This commit is contained in:
2026-02-08 13:09:33 +11:00
parent 1fb54400b4
commit de91475de5
33 changed files with 599 additions and 56064 deletions

75
ai_dev_plan.md Normal file
View File

@@ -0,0 +1,75 @@
# Personal AI Agent: "About Me" Profile Generator
**Project Goal**
Build a showcase AI system that scans and summarizes your professional/personal work from self-hosted services (primarily Gitea for code/repos, plus Flatnotes/Trillium/HedgeDoc for notes/ideas/projects). The agent answers employer-style questions dynamically (e.g., "Summarize Giordano's coding projects and skills") with RAG-grounded responses, links, and image embeds where relevant.
Emphasize broad AI toolchain integration for skill development and portfolio impact: agentic workflows, RAG pipelines, orchestration, multi-LLM support. No frontend focus — terminal/API-triggered queries only.
**Key Features**
- Periodic/full scanning of services to extract text, summaries, code snippets, links, images.
- Populate & query a local vector DB (RAG) for semantic search.
- Agent reasons, retrieves, generates responses with evidence (links/images).
- Multi-LLM fallback (DeepSeek primary, Gemini/OpenCode trigger).
- Scheduled/automated updates via pipelines.
- Local/Docker deployment for privacy & control.
**Tools & Stack Overview**
| Category | Tool(s) | Purpose & Why Chosen | Integration Role |
|-----------------------|----------------------------------|--------------------------------------------------------------------------------------|------------------|
| Core Framework | LangChain / LangGraph | Build agent, tools, chains, RAG logic. Modular, industry-standard for LLM apps. | Heart of agent & retrieval |
| Crawling/Extraction | Selenium / Playwright + Firecrawl (via LangChain loaders) | Handle auth/dynamic pages (Gitea login/nav), structured extraction (Markdown/JSON). | Scan web views & APIs |
| Vector Database | Chroma | Local, lightweight RAG store. Easy Docker setup, native LangChain integration. | Store embeddings for fast semantic search |
| LLM(s) | DeepSeek (via API) + Gemini / OpenCode | DeepSeek: cheap, strong reasoning (primary). Gemini/OpenCode: terminal trigger/fallback. | Reasoning & generation |
| Data Pipeline / Scheduling | Apache Airflow (Docker) | Industry-best for ETL/ETL-like scans (DAGs). Local install via official Compose. | Schedule periodic scans/updates to Chroma |
| Visual Prototyping | Flowise | No-code visual builder on LangChain. Quick agent/RAG prototyping & debugging. | Experiment with chains before code |
| Script/Workflow Orchestration | Windmill | Turn Python/LangChain scripts into reusable, scheduled flows. Dev-first, high growth.| Reactive workflows (e.g., on-commit triggers) |
| Event-Driven Automation | Activepieces | Connect services event-based (e.g., Gitea webhook → re-scan). AI-focused pieces. | Glue for reactive triggers |
**High-Level Architecture & Flow**
1. **Ingestion Pipeline (Airflow + Crawlers)**
- Airflow DAG runs on schedule (daily/weekly) or manually.
- Task 1: LangChain agent uses Selenium/Playwright tool to browse/authenticate to services (e.g., Gitea repos, Flatnotes/Trillium pages).
- Task 2: Firecrawl loader extracts structured content (text, code blocks, links, image URLs).
- Task 3: LangChain chunks, embeds (DeepSeek embeddings), upserts to Chroma vector DB.
- Optional: Activepieces listens for events (e.g., Gitea push webhook) → triggers partial re-scan.
2. **Agent Runtime (LangChain/LangGraph + DeepSeek)**
- Core agent (ReAct-style): Receives query (e.g., via terminal/OpenCode: "opencode query 'Giordano's top projects'").
- Tools: Retrieve from Chroma (RAG), fetch specific pages/images if needed.
- LLM: DeepSeek for cost-effective reasoning/summarization. Fallback to Gemini if complex.
- Output: Natural response with summaries, links (e.g., Gitea repo URLs), embedded image previews (from scanned pages).
3. **Prototyping & Orchestration Layer**
- Use Flowise to visually build/test agent chains/RAG flows before committing to code.
- Windmill wraps scripts (e.g., scan script) as jobs/APIs.
- Activepieces adds event-driven glue (e.g., new note in Trillium → notify/update DB).
**Deployment & Running Locally**
- Everything in Docker Compose: Airflow (official image), Chroma, Python services (LangChain agent), optional Flowise/Windmill containers.
- Secrets: Env vars for API keys (DeepSeek, service auth).
- Trigger: Terminal via OpenCode/Gemini CLI → calls agent endpoint/script.
- Scale: Start simple (manual scans), add Airflow scheduling later.
**Skill Showcase & Portfolio Value**
- Demonstrates: Agentic AI, RAG pipelines, web crawling with auth, multi-tool orchestration, cost-optimized LLMs, local/self-hosted infra.
- Broad coverage: LangChain ecosystem + industry ETL (Airflow) + modern AI workflow tools (Flowise/Windmill/Activepieces).
- Low cost: DeepSeek keeps API bills minimal (often <$5/month even with frequent scans/queries).
**Next Steps (Implementation Phases)**
1. Setup local Docker env + Chroma + DeepSeek API key.
2. Build basic crawler tools (Selenium + Firecrawl) for Gitea/Flatnotes.
3. Prototype agent in Flowise, then code in LangChain.
4. Add Airflow DAG for scheduled ingestion.
5. Integrate Windmill/Activepieces for extras.
6. Test queries, refine summaries/links/images.
This setup positions you strongly for AI engineering roles while building real, integrated skills.
** Extra tools to add.
- AutoMaker
- AutoCoder - These assist in set and forget long review AI
- OpenRouter - Single access point for any CLI with useage fee.
- Aider - CLI code and file editing with OpenROuter for any model
- Goose - integrates with system and MCP servers like ClawBot