skill: add image analysis with Qwen 2.5 VL via OpenRouter

2026-06-08 11:37:31 +10:00
parent 1619206398
commit d3ce7f12de
1 changed files with 56 additions and 1 deletions
--- a/skills/markitdown/SKILL.md
+++ b/skills/markitdown/SKILL.md
@@ -31,7 +31,7 @@ Then recreate the wrapper at `~/.local/bin/markitdown`.
 | Word | `.docx` | lxml, mammoth |
 | PowerPoint | `.pptx` | python-pptx |
 | Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd |
-| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin |
+| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); LLM vision via `llm_client`/`llm_model`; OCR via `markitdown-ocr` plugin (installed) |
 | HTML | `.html`, `.htm` | beautifulsoup4 (core) |
 | CSV | `.csv` | (core) |
 | JSON | `.json` | (core) |
@@ -79,9 +79,64 @@ This is especially useful for:
 - Images with EXIF metadata
 - Any file format not directly readable by the `read` tool
 ## Image Analysis (LLM Vision)
 For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model.
 **EXIF only (already works):**
 ```bash
 markitdown photo.jpg
 ```
 **With LLM vision — requires OpenRouter API key:**
 Set environment variable:
 ```bash
 export OPENROUTER_API_KEY=sk-or-v1-...
 ```
 Then use the wrapper:
 ```bash
 markitdown-vision photo.jpg
 ```
 Or use Python API directly:
 ```python
 from markitdown import MarkItDown
 from openai import OpenAI
 import os
 client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
 )
 md = MarkItDown(
    llm_client=client,
    llm_model="qwen/qwen2.5-vl-72b-instruct",
 )
 result = md.convert("photo.jpg")
 print(result.text_content)
 ```
 **Why Qwen 2.5 VL 72B?**
 - Excellent vision understanding
 - Affordable: ~$0.25/M input, ~$0.75/M output tokens
 - 131K context window
 - Available on OpenRouter
 **OCR inside documents (installed):**
 `markitdown-ocr` plugin is installed. Enable with:
 ```python
 md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct")
 ```
 This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files.
 ## Security Notes
 - MarkItDown performs I/O with current process privileges
 - Sanitize inputs in untrusted environments
 - Only convert files from trusted sources
 - The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed
 - Image analysis requires an OpenRouter API key — costs tokens per image