--- name: markitdown description: Convert various file formats to Markdown for use with LLMs and text analysis. Supports PDF, Word, Excel, PowerPoint, images, HTML, CSV, JSON, XML, ZIP, EPubs, and YouTube URLs. --- # MarkItDown Convert files to Markdown for LLM consumption and text analysis. A lightweight Python utility by Microsoft. ## Installation Installed in a Python venv at `/tmp/markitdown-env/` with a wrapper at `~/.local/bin/markitdown`. The wrapper handles `LD_LIBRARY_PATH` for numpy's C extensions on NixOS. If the venv is missing (e.g., after rebuild), recreate: ```bash nix-shell -p python3 python3.pkgs.pip python3.pkgs.virtualenv gcc stdenv.cc.cc.lib --run " python3 -m venv /tmp/markitdown-env source /tmp/markitdown-env/bin/activate pip install 'markitdown[pdf,docx,pptx,xlsx]' " ``` Then recreate the wrapper at `~/.local/bin/markitdown`. ## Supported Formats | Format | Extension | Dependencies | |--------|-----------|-------------| | PDF | `.pdf` | pdfminer-six, pdfplumber | | Word | `.docx` | lxml, mammoth | | PowerPoint | `.pptx` | python-pptx | | Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd | | Images | `.jpg`, `.png`, etc. | EXIF metadata (core); LLM vision via `llm_client`/`llm_model`; OCR via `markitdown-ocr` plugin (installed) | | HTML | `.html`, `.htm` | beautifulsoup4 (core) | | CSV | `.csv` | (core) | | JSON | `.json` | (core) | | XML | `.xml` | (core) | | ZIP | `.zip` | (core, iterates contents) | | EPubs | `.epub` | (core) | | YouTube | URLs | youtube-transcript-api (core) | | Text | `.txt`, `.md`, etc. | (core) | ## CLI Usage ```bash # Convert a file to Markdown (stdout) markitdown path/to/file.pdf # Write to file markitdown path/to/file.pdf -o output.md # Pipe content cat file.pdf | markitdown ``` ## Python API ```python from markitdown import MarkItDown md = MarkItDown() result = md.convert("document.pdf") print(result.text_content) ``` ## Integration with Pi Use `markitdown` to convert files before reading them with the `read` tool: ```bash # Convert then read markitdown report.pdf -o /tmp/report.md && read /tmp/report.md ``` This is especially useful for: - PDFs that need structure preserved (headings, lists, tables) - Office documents (Word, Excel, PowerPoint) - Images with EXIF metadata - Any file format not directly readable by the `read` tool ## Image Analysis (LLM Vision) For images, markitdown can extract EXIF metadata (free, no API key) AND describe image content using an LLM vision model. **EXIF only (already works):** ```bash markitdown photo.jpg ``` **With LLM vision — requires OpenRouter API key:** Set environment variable: ```bash export OPENROUTER_API_KEY=sk-or-v1-... ``` Then use the wrapper: ```bash markitdown-vision photo.jpg ``` Or use Python API directly: ```python from markitdown import MarkItDown from openai import OpenAI import os client = OpenAI( base_url="https://openrouter.ai/api/v1", api_key=os.environ["OPENROUTER_API_KEY"], ) md = MarkItDown( llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct", ) result = md.convert("photo.jpg") print(result.text_content) ``` **Why Qwen 2.5 VL 72B?** - Excellent vision understanding - Affordable: ~$0.25/M input, ~$0.75/M output tokens - 131K context window - Available on OpenRouter **OCR inside documents (installed):** `markitdown-ocr` plugin is installed. Enable with: ```python md = MarkItDown(enable_plugins=True, llm_client=client, llm_model="qwen/qwen2.5-vl-72b-instruct") ``` This extracts text from images embedded in PDFs, Word, PowerPoint, and Excel files. ## Security Notes - MarkItDown performs I/O with current process privileges - Sanitize inputs in untrusted environments - Only convert files from trusted sources - The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed - Image analysis requires an OpenRouter API key — costs tokens per image