From 161920639851eca17e602c35d3610fd101b960b2 Mon Sep 17 00:00:00 2001
From: Sam Rolfe <samuelrolfe@gmail.com>
Date: Sun, 7 Jun 2026 17:26:02 +1000
Subject: [PATCH] skill: add markitdown file converter

---
 skills/markitdown/SKILL.md | 87 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 skills/markitdown/SKILL.md

diff --git a/skills/markitdown/SKILL.md b/skills/markitdown/SKILL.md
new file mode 100644
index 0000000..91304ea
--- /dev/null
+++ b/skills/markitdown/SKILL.md
@@ -0,0 +1,87 @@
+---
+name: markitdown
+description: Convert various file formats to Markdown for use with LLMs and text analysis. Supports PDF, Word, Excel, PowerPoint, images, HTML, CSV, JSON, XML, ZIP, EPubs, and YouTube URLs.
+---
+
+# MarkItDown
+
+Convert files to Markdown for LLM consumption and text analysis. A lightweight Python utility by Microsoft.
+
+## Installation
+
+Installed in a Python venv at `/tmp/markitdown-env/` with a wrapper at `~/.local/bin/markitdown`.
+
+The wrapper handles `LD_LIBRARY_PATH` for numpy's C extensions on NixOS.
+
+If the venv is missing (e.g., after rebuild), recreate:
+```bash
+nix-shell -p python3 python3.pkgs.pip python3.pkgs.virtualenv gcc stdenv.cc.cc.lib --run "
+  python3 -m venv /tmp/markitdown-env
+  source /tmp/markitdown-env/bin/activate
+  pip install 'markitdown[pdf,docx,pptx,xlsx]'
+"
+```
+Then recreate the wrapper at `~/.local/bin/markitdown`.
+
+## Supported Formats
+
+| Format | Extension | Dependencies |
+|--------|-----------|-------------|
+| PDF | `.pdf` | pdfminer-six, pdfplumber |
+| Word | `.docx` | lxml, mammoth |
+| PowerPoint | `.pptx` | python-pptx |
+| Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd |
+| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin |
+| HTML | `.html`, `.htm` | beautifulsoup4 (core) |
+| CSV | `.csv` | (core) |
+| JSON | `.json` | (core) |
+| XML | `.xml` | (core) |
+| ZIP | `.zip` | (core, iterates contents) |
+| EPubs | `.epub` | (core) |
+| YouTube | URLs | youtube-transcript-api (core) |
+| Text | `.txt`, `.md`, etc. | (core) |
+
+## CLI Usage
+
+```bash
+# Convert a file to Markdown (stdout)
+markitdown path/to/file.pdf
+
+# Write to file
+markitdown path/to/file.pdf -o output.md
+
+# Pipe content
+cat file.pdf | markitdown
+```
+
+## Python API
+
+```python
+from markitdown import MarkItDown
+
+md = MarkItDown()
+result = md.convert("document.pdf")
+print(result.text_content)
+```
+
+## Integration with Pi
+
+Use `markitdown` to convert files before reading them with the `read` tool:
+
+```bash
+# Convert then read
+markitdown report.pdf -o /tmp/report.md && read /tmp/report.md
+```
+
+This is especially useful for:
+- PDFs that need structure preserved (headings, lists, tables)
+- Office documents (Word, Excel, PowerPoint)
+- Images with EXIF metadata
+- Any file format not directly readable by the `read` tool
+
+## Security Notes
+
+- MarkItDown performs I/O with current process privileges
+- Sanitize inputs in untrusted environments
+- Only convert files from trusted sources
+- The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed