skill: add image analysis with Qwen 2.5 VL via OpenRouter
This commit is contained in:
@@ -31,7 +31,7 @@ Then recreate the wrapper at `~/.local/bin/markitdown`.
|
||||
| Word | `.docx` | lxml, mammoth |
|
||||
| PowerPoint | `.pptx` | python-pptx |
|
||||
| Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd |
|
||||
| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin |
|
||||
| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); LLM vision via `llm_client`/`llm_model`; OCR via `markitdown-ocr` plugin (installed) |
|
||||
| HTML | `.html`, `.htm` | beautifulsoup4 (core) |
|
||||
| CSV | `.csv` | (core) |
|
||||
| JSON | `.json` | (core) |
|
||||
@@ -79,9 +79,64 @@ This is especially useful for:
|
||||
- Images with EXIF metadata
|
||||
- Any file format not directly readable by the `read` tool
|
||||
|
||||
## Image Analysis (LLM Vision)
|
||||
|
||||
For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model.
|
||||
|
||||
**EXIF only (already works):**
|
||||
```bash
|
||||
markitdown photo.jpg
|
||||
```
|
||||
|
||||
**With LLM vision — requires OpenRouter API key:**
|
||||
|
||||
Set environment variable:
|
||||
```bash
|
||||
export OPENROUTER_API_KEY=sk-or-v1-...
|
||||
```
|
||||
|
||||
Then use the wrapper:
|
||||
```bash
|
||||
markitdown-vision photo.jpg
|
||||
```
|
||||
|
||||
Or use Python API directly:
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
from openai import OpenAI
|
||||
import os
|
||||
|
||||
client = OpenAI(
|
||||
base_url="https://openrouter.ai/api/v1",
|
||||
api_key=os.environ["OPENROUTER_API_KEY"],
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
llm_client=client,
|
||||
llm_model="qwen/qwen2.5-vl-72b-instruct",
|
||||
)
|
||||
|
||||
result = md.convert("photo.jpg")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
**Why Qwen 2.5 VL 72B?**
|
||||
- Excellent vision understanding
|
||||
- Affordable: ~$0.25/M input, ~$0.75/M output tokens
|
||||
- 131K context window
|
||||
- Available on OpenRouter
|
||||
|
||||
**OCR inside documents (installed):**
|
||||
`markitdown-ocr` plugin is installed. Enable with:
|
||||
```python
|
||||
md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct")
|
||||
```
|
||||
This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files.
|
||||
|
||||
## Security Notes
|
||||
|
||||
- MarkItDown performs I/O with current process privileges
|
||||
- Sanitize inputs in untrusted environments
|
||||
- Only convert files from trusted sources
|
||||
- The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed
|
||||
- Image analysis requires an OpenRouter API key — costs tokens per image
|
||||
|
||||
Reference in New Issue
Block a user