skill: add image analysis with Qwen 2.5 VL via OpenRouter

This commit is contained in:
2026-06-08 11:37:31 +10:00
parent 1619206398
commit d3ce7f12de

View File

@@ -31,7 +31,7 @@ Then recreate the wrapper at `~/.local/bin/markitdown`.
| Word | `.docx` | lxml, mammoth | | Word | `.docx` | lxml, mammoth |
| PowerPoint | `.pptx` | python-pptx | | PowerPoint | `.pptx` | python-pptx |
| Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd | | Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd |
| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin | | Images | `.jpg`, `.png`, etc. | EXIF metadata (core); LLM vision via `llm_client`/`llm_model`; OCR via `markitdown-ocr` plugin (installed) |
| HTML | `.html`, `.htm` | beautifulsoup4 (core) | | HTML | `.html`, `.htm` | beautifulsoup4 (core) |
| CSV | `.csv` | (core) | | CSV | `.csv` | (core) |
| JSON | `.json` | (core) | | JSON | `.json` | (core) |
@@ -79,9 +79,64 @@ This is especially useful for:
- Images with EXIF metadata - Images with EXIF metadata
- Any file format not directly readable by the `read` tool - Any file format not directly readable by the `read` tool
## Image Analysis (LLM Vision)
For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model.
**EXIF only (already works):**
```bash
markitdown photo.jpg
```
**With LLM vision — requires OpenRouter API key:**
Set environment variable:
```bash
export OPENROUTER_API_KEY=sk-or-v1-...
```
Then use the wrapper:
```bash
markitdown-vision photo.jpg
```
Or use Python API directly:
```python
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
md = MarkItDown(
llm_client=client,
llm_model="qwen/qwen2.5-vl-72b-instruct",
)
result = md.convert("photo.jpg")
print(result.text_content)
```
**Why Qwen 2.5 VL 72B?**
- Excellent vision understanding
- Affordable: ~$0.25/M input, ~$0.75/M output tokens
- 131K context window
- Available on OpenRouter
**OCR inside documents (installed):**
`markitdown-ocr` plugin is installed. Enable with:
```python
md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct")
```
This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files.
## Security Notes ## Security Notes
- MarkItDown performs I/O with current process privileges - MarkItDown performs I/O with current process privileges
- Sanitize inputs in untrusted environments - Sanitize inputs in untrusted environments
- Only convert files from trusted sources - Only convert files from trusted sources
- The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed - The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed
- Image analysis requires an OpenRouter API key — costs tokens per image