2.5 KiB
2.5 KiB
name, description
| name | description |
|---|---|
| markitdown | Convert various file formats to Markdown for use with LLMs and text analysis. Supports PDF, Word, Excel, PowerPoint, images, HTML, CSV, JSON, XML, ZIP, EPubs, and YouTube URLs. |
MarkItDown
Convert files to Markdown for LLM consumption and text analysis. A lightweight Python utility by Microsoft.
Installation
Installed in a Python venv at /tmp/markitdown-env/ with a wrapper at ~/.local/bin/markitdown.
The wrapper handles LD_LIBRARY_PATH for numpy's C extensions on NixOS.
If the venv is missing (e.g., after rebuild), recreate:
nix-shell -p python3 python3.pkgs.pip python3.pkgs.virtualenv gcc stdenv.cc.cc.lib --run "
python3 -m venv /tmp/markitdown-env
source /tmp/markitdown-env/bin/activate
pip install 'markitdown[pdf,docx,pptx,xlsx]'
"
Then recreate the wrapper at ~/.local/bin/markitdown.
Supported Formats
| Format | Extension | Dependencies |
|---|---|---|
.pdf |
pdfminer-six, pdfplumber | |
| Word | .docx |
lxml, mammoth |
| PowerPoint | .pptx |
python-pptx |
| Excel | .xlsx, .xls |
openpyxl, pandas, xlrd |
| Images | .jpg, .png, etc. |
EXIF metadata (core); OCR needs separate markitdown-ocr plugin |
| HTML | .html, .htm |
beautifulsoup4 (core) |
| CSV | .csv |
(core) |
| JSON | .json |
(core) |
| XML | .xml |
(core) |
| ZIP | .zip |
(core, iterates contents) |
| EPubs | .epub |
(core) |
| YouTube | URLs | youtube-transcript-api (core) |
| Text | .txt, .md, etc. |
(core) |
CLI Usage
# Convert a file to Markdown (stdout)
markitdown path/to/file.pdf
# Write to file
markitdown path/to/file.pdf -o output.md
# Pipe content
cat file.pdf | markitdown
Python API
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
Integration with Pi
Use markitdown to convert files before reading them with the read tool:
# Convert then read
markitdown report.pdf -o /tmp/report.md && read /tmp/report.md
This is especially useful for:
- PDFs that need structure preserved (headings, lists, tables)
- Office documents (Word, Excel, PowerPoint)
- Images with EXIF metadata
- Any file format not directly readable by the
readtool
Security Notes
- MarkItDown performs I/O with current process privileges
- Sanitize inputs in untrusted environments
- Only convert files from trusted sources
- The
[pdf,docx,pptx,xlsx]extras are installed; audio transcription and Azure AI are NOT installed