skill: add image analysis with Qwen 2.5 VL via OpenRouter

2026-06-08 11:37:31 +10:00
parent 1619206398
commit d3ce7f12de
1 changed files with 56 additions and 1 deletions
--- a/skills/markitdown/SKILL.md
+++ b/skills/markitdown/SKILL.md
@@ -31,7 +31,7 @@ Then recreate the wrapper at `~/.local/bin/markitdown`.
 | Word | `.docx` | lxml, mammoth |
 | PowerPoint | `.pptx` | python-pptx |
 | Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd |
-| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin |
+| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); LLM vision via `llm_client`/`llm_model`; OCR via `markitdown-ocr` plugin (installed) |
 | HTML | `.html`, `.htm` | beautifulsoup4 (core) |
 | CSV | `.csv` | (core) |
 | JSON | `.json` | (core) |
@@ -79,9 +79,64 @@ This is especially useful for:
 - Images with EXIF metadata
 - Any file format not directly readable by the `read` tool

+## Image Analysis (LLM Vision)
+
+For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model.
+
+**EXIF only (already works):**
+```bash
+markitdown photo.jpg
+```
+
+**With LLM vision — requires OpenRouter API key:**
+
+Set environment variable:
+```bash
+export OPENROUTER_API_KEY=sk-or-v1-...
+```
+
+Then use the wrapper:
+```bash
+markitdown-vision photo.jpg
+```
+
+Or use Python API directly:
+```python
+from markitdown import MarkItDown
+from openai import OpenAI
+import os
+
+client = OpenAI(
+    base_url="https://openrouter.ai/api/v1",
+    api_key=os.environ["OPENROUTER_API_KEY"],
+)
+
+md = MarkItDown(
+    llm_client=client,
+    llm_model="qwen/qwen2.5-vl-72b-instruct",
+)
+
+result = md.convert("photo.jpg")
+print(result.text_content)
+```
+
+**Why Qwen 2.5 VL 72B?**
+- Excellent vision understanding
+- Affordable: ~$0.25/M input, ~$0.75/M output tokens
+- 131K context window
+- Available on OpenRouter
+
+**OCR inside documents (installed):**
+`markitdown-ocr` plugin is installed. Enable with:
+```python
+md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct")
+```
+This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files.
+
 ## Security Notes

 - MarkItDown performs I/O with current process privileges
 - Sanitize inputs in untrusted environments
 - Only convert files from trusted sources
 - The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed
+- Image analysis requires an OpenRouter API key — costs tokens per image