From 161920639851eca17e602c35d3610fd101b960b2 Mon Sep 17 00:00:00 2001 From: Sam Rolfe Date: Sun, 7 Jun 2026 17:26:02 +1000 Subject: [PATCH] skill: add markitdown file converter --- skills/markitdown/SKILL.md | 87 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 skills/markitdown/SKILL.md diff --git a/skills/markitdown/SKILL.md b/skills/markitdown/SKILL.md new file mode 100644 index 0000000..91304ea --- /dev/null +++ b/skills/markitdown/SKILL.md @@ -0,0 +1,87 @@ +--- +name: markitdown +description: Convert various file formats to Markdown for use with LLMs and text analysis. Supports PDF, Word, Excel, PowerPoint, images, HTML, CSV, JSON, XML, ZIP, EPubs, and YouTube URLs. +--- + +# MarkItDown + +Convert files to Markdown for LLM consumption and text analysis. A lightweight Python utility by Microsoft. + +## Installation + +Installed in a Python venv at `/tmp/markitdown-env/` with a wrapper at `~/.local/bin/markitdown`. + +The wrapper handles `LD_LIBRARY_PATH` for numpy's C extensions on NixOS. + +If the venv is missing (e.g., after rebuild), recreate: +```bash +nix-shell -p python3 python3.pkgs.pip python3.pkgs.virtualenv gcc stdenv.cc.cc.lib --run " + python3 -m venv /tmp/markitdown-env + source /tmp/markitdown-env/bin/activate + pip install 'markitdown[pdf,docx,pptx,xlsx]' +" +``` +Then recreate the wrapper at `~/.local/bin/markitdown`. + +## Supported Formats + +| Format | Extension | Dependencies | +|--------|-----------|-------------| +| PDF | `.pdf` | pdfminer-six, pdfplumber | +| Word | `.docx` | lxml, mammoth | +| PowerPoint | `.pptx` | python-pptx | +| Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd | +| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin | +| HTML | `.html`, `.htm` | beautifulsoup4 (core) | +| CSV | `.csv` | (core) | +| JSON | `.json` | (core) | +| XML | `.xml` | (core) | +| ZIP | `.zip` | (core, iterates contents) | +| EPubs | `.epub` | (core) | +| YouTube | URLs | youtube-transcript-api (core) | +| Text | `.txt`, `.md`, etc. | (core) | + +## CLI Usage + +```bash +# Convert a file to Markdown (stdout) +markitdown path/to/file.pdf + +# Write to file +markitdown path/to/file.pdf -o output.md + +# Pipe content +cat file.pdf | markitdown +``` + +## Python API + +```python +from markitdown import MarkItDown + +md = MarkItDown() +result = md.convert("document.pdf") +print(result.text_content) +``` + +## Integration with Pi + +Use `markitdown` to convert files before reading them with the `read` tool: + +```bash +# Convert then read +markitdown report.pdf -o /tmp/report.md && read /tmp/report.md +``` + +This is especially useful for: +- PDFs that need structure preserved (headings, lists, tables) +- Office documents (Word, Excel, PowerPoint) +- Images with EXIF metadata +- Any file format not directly readable by the `read` tool + +## Security Notes + +- MarkItDown performs I/O with current process privileges +- Sanitize inputs in untrusted environments +- Only convert files from trusted sources +- The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed