3.8 KiB
name, description
| name | description |
|---|---|
| markitdown | Convert various file formats to Markdown for use with LLMs and text analysis. Supports PDF, Word, Excel, PowerPoint, images, HTML, CSV, JSON, XML, ZIP, EPubs, and YouTube URLs. |
MarkItDown
Convert files to Markdown for LLM consumption and text analysis. A lightweight Python utility by Microsoft.
Installation
Installed in a Python venv at /tmp/markitdown-env/ with a wrapper at ~/.local/bin/markitdown.
The wrapper handles LD_LIBRARY_PATH for numpy's C extensions on NixOS.
If the venv is missing (e.g., after rebuild), recreate:
nix-shell -p python3 python3.pkgs.pip python3.pkgs.virtualenv gcc stdenv.cc.cc.lib --run "
python3 -m venv /tmp/markitdown-env
source /tmp/markitdown-env/bin/activate
pip install 'markitdown[pdf,docx,pptx,xlsx]'
"
Then recreate the wrapper at ~/.local/bin/markitdown.
Supported Formats
| Format | Extension | Dependencies |
|---|---|---|
.pdf |
pdfminer-six, pdfplumber | |
| Word | .docx |
lxml, mammoth |
| PowerPoint | .pptx |
python-pptx |
| Excel | .xlsx, .xls |
openpyxl, pandas, xlrd |
| Images | .jpg, .png, etc. |
EXIF metadata (core); LLM vision via llm_client/llm_model; OCR via markitdown-ocr plugin (installed) |
| HTML | .html, .htm |
beautifulsoup4 (core) |
| CSV | .csv |
(core) |
| JSON | .json |
(core) |
| XML | .xml |
(core) |
| ZIP | .zip |
(core, iterates contents) |
| EPubs | .epub |
(core) |
| YouTube | URLs | youtube-transcript-api (core) |
| Text | .txt, .md, etc. |
(core) |
CLI Usage
# Convert a file to Markdown (stdout)
markitdown path/to/file.pdf
# Write to file
markitdown path/to/file.pdf -o output.md
# Pipe content
cat file.pdf | markitdown
Python API
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
Integration with Pi
Use markitdown to convert files before reading them with the read tool:
# Convert then read
markitdown report.pdf -o /tmp/report.md && read /tmp/report.md
This is especially useful for:
- PDFs that need structure preserved (headings, lists, tables)
- Office documents (Word, Excel, PowerPoint)
- Images with EXIF metadata
- Any file format not directly readable by the
readtool
Image Analysis (LLM Vision)
For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model.
EXIF only (already works):
markitdown photo.jpg
With LLM vision — requires OpenRouter API key:
Set environment variable:
export OPENROUTER_API_KEY=sk-or-v1-...
Then use the wrapper:
markitdown-vision photo.jpg
Or use Python API directly:
from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
md = MarkItDown(
llm_client=client,
llm_model="qwen/qwen2.5-vl-72b-instruct",
)
result = md.convert("photo.jpg")
print(result.text_content)
Why Qwen 2.5 VL 72B?
- Excellent vision understanding
- Affordable: ~$0.25/M input, ~$0.75/M output tokens
- 131K context window
- Available on OpenRouter
OCR inside documents (installed):
markitdown-ocr plugin is installed. Enable with:
md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct")
This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files.
Security Notes
- MarkItDown performs I/O with current process privileges
- Sanitize inputs in untrusted environments
- Only convert files from trusted sources
- The
[pdf,docx,pptx,xlsx]extras are installed; audio transcription and Azure AI are NOT installed - Image analysis requires an OpenRouter API key — costs tokens per image