From d3ce7f12deace5bfb388769b1dca0ea197698dd2 Mon Sep 17 00:00:00 2001 From: Sam Rolfe Date: Mon, 8 Jun 2026 11:37:31 +1000 Subject: [PATCH] skill: add image analysis with Qwen 2.5 VL via OpenRouter --- skills/markitdown/SKILL.md | 57 +++++++++++++++++++++++++++++++++++++- 1 file changed, 56 insertions(+), 1 deletion(-) diff --git a/skills/markitdown/SKILL.md b/skills/markitdown/SKILL.md index 91304ea..cd8c44e 100644 --- a/skills/markitdown/SKILL.md +++ b/skills/markitdown/SKILL.md @@ -31,7 +31,7 @@ Then recreate the wrapper at `~/.local/bin/markitdown`. | Word | `.docx` | lxml, mammoth | | PowerPoint | `.pptx` | python-pptx | | Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd | -| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin | +| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); LLM vision via `llm_client`/`llm_model`; OCR via `markitdown-ocr` plugin (installed) | | HTML | `.html`, `.htm` | beautifulsoup4 (core) | | CSV | `.csv` | (core) | | JSON | `.json` | (core) | @@ -79,9 +79,64 @@ This is especially useful for: - Images with EXIF metadata - Any file format not directly readable by the `read` tool +## Image Analysis (LLM Vision) + +For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model. + +**EXIF only (already works):** +```bash +markitdown photo.jpg +``` + +**With LLM vision — requires OpenRouter API key:** + +Set environment variable: +```bash +export OPENROUTER_API_KEY=sk-or-v1-... +``` + +Then use the wrapper: +```bash +markitdown-vision photo.jpg +``` + +Or use Python API directly: +```python +from markitdown import MarkItDown +from openai import OpenAI +import os + +client = OpenAI( + base_url="https://openrouter.ai/api/v1", + api_key=os.environ["OPENROUTER_API_KEY"], +) + +md = MarkItDown( + llm_client=client, + llm_model="qwen/qwen2.5-vl-72b-instruct", +) + +result = md.convert("photo.jpg") +print(result.text_content) +``` + +**Why Qwen 2.5 VL 72B?** +- Excellent vision understanding +- Affordable: ~$0.25/M input, ~$0.75/M output tokens +- 131K context window +- Available on OpenRouter + +**OCR inside documents (installed):** +`markitdown-ocr` plugin is installed. Enable with: +```python +md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct") +``` +This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files. + ## Security Notes - MarkItDown performs I/O with current process privileges - Sanitize inputs in untrusted environments - Only convert files from trusted sources - The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed +- Image analysis requires an OpenRouter API key — costs tokens per image