skill: add markitdown file converter
This commit is contained in:
87
skills/markitdown/SKILL.md
Normal file
87
skills/markitdown/SKILL.md
Normal file
@@ -0,0 +1,87 @@
|
||||
---
|
||||
name: markitdown
|
||||
description: Convert various file formats to Markdown for use with LLMs and text analysis. Supports PDF, Word, Excel, PowerPoint, images, HTML, CSV, JSON, XML, ZIP, EPubs, and YouTube URLs.
|
||||
---
|
||||
|
||||
# MarkItDown
|
||||
|
||||
Convert files to Markdown for LLM consumption and text analysis. A lightweight Python utility by Microsoft.
|
||||
|
||||
## Installation
|
||||
|
||||
Installed in a Python venv at `/tmp/markitdown-env/` with a wrapper at `~/.local/bin/markitdown`.
|
||||
|
||||
The wrapper handles `LD_LIBRARY_PATH` for numpy's C extensions on NixOS.
|
||||
|
||||
If the venv is missing (e.g., after rebuild), recreate:
|
||||
```bash
|
||||
nix-shell -p python3 python3.pkgs.pip python3.pkgs.virtualenv gcc stdenv.cc.cc.lib --run "
|
||||
python3 -m venv /tmp/markitdown-env
|
||||
source /tmp/markitdown-env/bin/activate
|
||||
pip install 'markitdown[pdf,docx,pptx,xlsx]'
|
||||
"
|
||||
```
|
||||
Then recreate the wrapper at `~/.local/bin/markitdown`.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
| Format | Extension | Dependencies |
|
||||
|--------|-----------|-------------|
|
||||
| PDF | `.pdf` | pdfminer-six, pdfplumber |
|
||||
| Word | `.docx` | lxml, mammoth |
|
||||
| PowerPoint | `.pptx` | python-pptx |
|
||||
| Excel | `.xlsx`, `.xls` | openpyxl, pandas, xlrd |
|
||||
| Images | `.jpg`, `.png`, etc. | EXIF metadata (core); OCR needs separate `markitdown-ocr` plugin |
|
||||
| HTML | `.html`, `.htm` | beautifulsoup4 (core) |
|
||||
| CSV | `.csv` | (core) |
|
||||
| JSON | `.json` | (core) |
|
||||
| XML | `.xml` | (core) |
|
||||
| ZIP | `.zip` | (core, iterates contents) |
|
||||
| EPubs | `.epub` | (core) |
|
||||
| YouTube | URLs | youtube-transcript-api (core) |
|
||||
| Text | `.txt`, `.md`, etc. | (core) |
|
||||
|
||||
## CLI Usage
|
||||
|
||||
```bash
|
||||
# Convert a file to Markdown (stdout)
|
||||
markitdown path/to/file.pdf
|
||||
|
||||
# Write to file
|
||||
markitdown path/to/file.pdf -o output.md
|
||||
|
||||
# Pipe content
|
||||
cat file.pdf | markitdown
|
||||
```
|
||||
|
||||
## Python API
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
md = MarkItDown()
|
||||
result = md.convert("document.pdf")
|
||||
print(result.text_content)
|
||||
```
|
||||
|
||||
## Integration with Pi
|
||||
|
||||
Use `markitdown` to convert files before reading them with the `read` tool:
|
||||
|
||||
```bash
|
||||
# Convert then read
|
||||
markitdown report.pdf -o /tmp/report.md && read /tmp/report.md
|
||||
```
|
||||
|
||||
This is especially useful for:
|
||||
- PDFs that need structure preserved (headings, lists, tables)
|
||||
- Office documents (Word, Excel, PowerPoint)
|
||||
- Images with EXIF metadata
|
||||
- Any file format not directly readable by the `read` tool
|
||||
|
||||
## Security Notes
|
||||
|
||||
- MarkItDown performs I/O with current process privileges
|
||||
- Sanitize inputs in untrusted environments
|
||||
- Only convert files from trusted sources
|
||||
- The `[pdf,docx,pptx,xlsx]` extras are installed; audio transcription and Azure AI are NOT installed
|
||||
Reference in New Issue
Block a user