Files
pi-config/extensions/pi-graphify/skills/graphify/SKILL.md
2026-05-10 15:29:02 +10:00

882 lines
37 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: graphify
description: "Full-pipeline knowledge graph orchestration for graphify. Use when the user asks to build a graph from scratch, run a deep extraction, generate specific export formats (Obsidian, SVG, GraphML, Neo4j), transcribe video, or any operation that requires multi-step orchestration beyond the extension's tools. The extension tools (graphify_build, graphify_query, etc.) handle simple operations; this skill handles the full build pipeline."
---
# /graphify
Turn any folder of files into a navigable knowledge graph with community detection, an honest audit trail, and multiple outputs: interactive HTML, GraphRAG-ready JSON, Obsidian vault, and a plain-language GRAPH_REPORT.md.
## Extension Tools vs. This Skill
The pi-graphify extension registers 8 tools the LLM can call autonomously:
| Tool | When to use |
|------|-------------|
| `graphify_build` | Quick build with standard settings |
| `graphify_query` | BFS/DFS traversal questions |
| `graphify_path` | Shortest path between concepts |
| `graphify_explain` | Plain-language explanation of a node |
| `graphify_add` | Fetch a URL and add to corpus |
| `graphify_update` | Incremental re-extraction |
| `graphify_cluster` | Re-run community detection only |
| `graphify_watch` | Watch directory for changes |
**Use this skill instead of the tools when:**
- The user runs `/graphify` explicitly (they want the full pipeline)
- The user wants export formats the tools don't produce (Obsidian vault, SVG, GraphML, Neo4j cypher)
- The corpus contains video/audio that needs transcription
- The user wants the guided exploration flow after building
- The user wants to run semantic extraction with subagent dispatch (the tools only call the Python CLI, not multi-step orchestration)
For simple operations (query, explain, path, add, update, cluster), prefer the extension tools — they're faster and the LLM can call them autonomously.
## Usage
### `/graphify` command surface (implemented in this extension)
```
/graphify # full pipeline on current directory
/graphify <path> # full pipeline on specific path
/graphify <path> --mode deep # thorough extraction, richer INFERRED edges
/graphify <path> --update # incremental - re-extract only new/changed files
/graphify <path> --cluster-only # rerun clustering on existing graph
/graphify <path> --no-viz # skip visualization, just report + JSON
/graphify <path> --obsidian # generate Obsidian vault
/graphify <path> --svg # also export graph.svg (embeds in Notion, GitHub)
/graphify <path> --graphml # export graph.graphml (Gephi, yEd)
/graphify <path> --neo4j # generate graphify-out/cypher.txt for Neo4j
/graphify add <url> # fetch URL, save to ./raw, update graph
/graphify add <url> --author "Name" # tag who wrote it
/graphify add <url> --contributor "Name" # tag who added it to the corpus
/graphify query "<question>" # BFS traversal - broad context
/graphify query "<question>" --dfs # DFS - trace a specific path
/graphify query "<question>" --budget 1500 # cap answer at N tokens
/graphify path "AuthModule" "Database" # shortest path between two concepts
/graphify explain "SwinTransformer" # plain-language explanation of a node
/graphify update <path> # incremental update (subcommand form)
/graphify watch <path> # watch folder, auto-rebuild on changes
/graphify cluster # rerun clustering on existing graph
/graphify hook <install|uninstall|status> # manage git hooks
```
### Advanced orchestration options (skill-level, not `/graphify` flags)
Use these only when driving the full pipeline from this skill (for example via `/skill:graphify ...`) or when the user asks explicitly:
- `--neo4j-push <uri>`: push graph directly to Neo4j
- `--mcp`: start MCP stdio server for live graph queries
- `--whisper-model <name>`: override transcription model
## What graphify is for
graphify is built around Andrej Karpathy's /raw folder workflow: drop anything into a folder - papers, tweets, screenshots, code, notes - and get a structured knowledge graph that shows you what you didn't know was connected.
Three things it does that your AI assistant alone cannot:
1. **Persistent graph** - relationships are stored in `graphify-out/graph.json` and survive across sessions. Ask questions weeks later without re-reading everything.
2. **Honest audit trail** - every edge is tagged EXTRACTED, INFERRED, or AMBIGUOUS. You know what was found vs invented.
3. **Cross-document surprise** - community detection finds connections between concepts in different files that you would never think to ask about directly.
## What You Must Do When Invoked
If no path was given, use `.` (current directory). Do not ask the user for a path.
### Quick operations → use extension tools
For `query`, `path`, `explain`, `add`, `update`, `cluster`, and `watch` operations, call the corresponding extension tool directly:
| Operation | Tool to call |
|-----------|-------------|
| `query "<question>"` | `graphify_query` with `question` param |
| `query "<question>" --dfs` | `graphify_query` with `question` and `mode: "dfs"` |
| `path "A" "B"` | `graphify_path` with `from` and `to` |
| `explain "Node"` | `graphify_explain` with `concept` |
| `add <url>` | `graphify_add` with `url` |
| `update <path>` or build flag `--update` | `graphify_update` with `path` |
| `cluster` or build flag `--cluster-only` | `graphify_cluster` |
| `watch <path>` | `graphify_watch` with `path` |
### Full pipeline → follow these steps
Follow these steps in order. Do not skip steps.
#### Step 1 - Ensure graphify is installed
```bash
# Detect the correct Python interpreter (handles pipx, venv, system installs)
GRAPHIFY_BIN=$(which graphify 2>/dev/null)
if [ -n "$GRAPHIFY_BIN" ]; then
PYTHON=$(head -1 "$GRAPHIFY_BIN" | tr -d '#!')
case "$PYTHON" in
*[!a-zA-Z0-9/_.-]*) PYTHON="python3" ;;
esac
else
PYTHON="python3"
fi
"$PYTHON" -c "import graphify" 2>/dev/null || "$PYTHON" -m pip install graphifyy -q 2>/dev/null || "$PYTHON" -m pip install graphifyy -q --break-system-packages 2>&1 | tail -3
mkdir -p graphify-out
# Write interpreter path for all subsequent steps
"$PYTHON" -c "import sys; open('.graphify_python', 'w').write(sys.executable)"
```
If the import succeeds, print nothing and move straight to Step 2.
**In every subsequent bash block, replace `python3` with `$(cat .graphify_python)` to use the correct interpreter.**
#### Step 2 - Detect files
```bash
$(cat .graphify_python) -c "
import json
from graphify.detect import detect
from pathlib import Path
result = detect(Path('INPUT_PATH'))
print(json.dumps(result))
" > .graphify_detect.json
```
Replace INPUT_PATH with the actual path the user provided. Do NOT cat or print the JSON - read it silently and present a clean summary instead:
```
Corpus: X files · ~Y words
code: N files (.py .ts .go ...)
docs: N files (.md .txt ...)
papers: N files (.pdf ...)
images: N files
video: N files (.mp4 .mp3 ...)
```
Omit any category with 0 files from the summary.
Then act on it:
- If `total_files` is 0: stop with "No supported files found in [path]."
- If `skipped_sensitive` is non-empty: mention file count skipped, not the file names.
- If `total_words` > 2,000,000 OR `total_files` > 200: show the warning and the top 5 subdirectories by file count, then ask which subfolder to run on. Wait for the user's answer before proceeding.
- Otherwise: proceed directly to Step 2.5 if video files were detected, or Step 3 if not.
#### Step 2.5 - Transcribe video / audio files (only if video files detected)
Skip this step entirely if `detect` returned zero `video` files.
Video and audio files cannot be read directly. Transcribe them to text first, then treat the transcripts as doc files in Step 3.
**Strategy:** Read the god nodes from the detect output or analysis file. You are already a language model - write a one-sentence domain hint yourself from those labels. Then pass it to Whisper as the initial prompt. No separate API call needed.
**However**, if the corpus has *only* video files and no other docs/code, use the generic fallback prompt: `"Use proper punctuation and paragraph breaks."`
**Step 1 - Write the Whisper prompt yourself.**
Read the top god node labels from detect output or analysis, then compose a short domain hint sentence, for example:
- Labels: `transformer, attention, encoder, decoder` -> `"Machine learning research on transformer architectures and attention mechanisms. Use proper punctuation and paragraph breaks."`
- Labels: `kubernetes, deployment, pod, helm` -> `"DevOps discussion about Kubernetes deployments and Helm charts. Use proper punctuation and paragraph breaks."`
Set it as `GRAPHIFY_WHISPER_PROMPT` in the environment before running the transcription command.
**Step 2 - Transcribe:**
```bash
$(cat .graphify_python) -c "
import json, os
from pathlib import Path
from graphify.transcribe import transcribe_all
detect = json.loads(Path('.graphify_detect.json').read_text())
video_files = detect.get('files', {}).get('video', [])
prompt = os.environ.get('GRAPHIFY_WHISPER_PROMPT', 'Use proper punctuation and paragraph breaks.')
transcript_paths = transcribe_all(video_files, initial_prompt=prompt)
print(json.dumps(transcript_paths))
" > graphify-out/.graphify_transcripts.json
```
After transcription:
- Read the transcript paths from `graphify-out/.graphify_transcripts.json`
- Add them to the docs list before dispatching semantic subagents in Step 3B
- Print how many transcripts were created: `Transcribed N video file(s) -> treating as docs`
- If transcription fails for a file, print a warning and continue with the rest
**Whisper model:** Default is `base`. If the user passed `--whisper-model <name>`, set `GRAPHIFY_WHISPER_MODEL=<name>` in the environment before running the command above.
#### Step 3 - Extract entities and relationships
**Before starting:** note whether `--mode deep` was given. You must pass `DEEP_MODE=true` to every subagent in Step B2 if it was. Track this from the original invocation - do not lose it.
This step has two parts: **structural extraction** (deterministic, free) and **semantic extraction** (your AI model, costs tokens).
**Run Part A (AST) and Part B (semantic) in parallel. Dispatch all semantic subagents AND start AST extraction in the same message. Both can run simultaneously since they operate on different file types. Merge results in Part C as before.**
Note: Parallelizing AST + semantic saves 5-15s on large corpora. AST is deterministic and fast; start it while subagents are processing docs/papers.
##### Part A - Structural extraction for code files
For any code files detected, run AST extraction in parallel with Part B subagents:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.extract import collect_files, extract
from pathlib import Path
import json
code_files = []
detect = json.loads(Path('.graphify_detect.json').read_text())
for f in detect.get('files', {}).get('code', []):
code_files.extend(collect_files(Path(f)) if Path(f).is_dir() else [Path(f)])
if code_files:
result = extract(code_files)
Path('.graphify_ast.json').write_text(json.dumps(result, indent=2))
print(f'AST: {len(result[\"nodes\"])} nodes, {len(result[\"edges\"])} edges')
else:
Path('.graphify_ast.json').write_text(json.dumps({'nodes':[],'edges':[],'input_tokens':0,'output_tokens':0}))
print('No code files - skipping AST extraction')
"
```
##### Part B - Semantic extraction (parallel subagents)
**Fast path:** If detection found zero docs, papers, and images (code-only corpus), skip Part B entirely and go straight to Part C. AST handles code - there is nothing for semantic subagents to do.
> **OpenClaw platform:** Multi-agent support is still early on OpenClaw. Extraction runs sequentially — you read and extract each file yourself. This is slower than parallel platforms but fully reliable.
Print: `"Semantic extraction: N files (sequential — OpenClaw)"`
**Step B0 - Check extraction cache first**
Before dispatching any subagents, check which files already have cached extraction results:
```bash
$(cat .graphify_python) -c "
import json
from graphify.cache import check_semantic_cache
from pathlib import Path
detect = json.loads(Path('.graphify_detect.json').read_text())
all_files = [f for files in detect['files'].values() for f in files]
cached_nodes, cached_edges, cached_hyperedges, uncached = check_semantic_cache(all_files)
if cached_nodes or cached_edges or cached_hyperedges:
Path('.graphify_cached.json').write_text(json.dumps({'nodes': cached_nodes, 'edges': cached_edges, 'hyperedges': cached_hyperedges}))
Path('.graphify_uncached.txt').write_text('\n'.join(uncached))
print(f'Cache: {len(all_files)-len(uncached)} files hit, {len(uncached)} files need extraction')
"
```
Only dispatch subagents for files listed in `.graphify_uncached.txt`. If all files are cached, skip to Part C directly.
**Step B1 - Split into chunks**
Load files from `.graphify_uncached.txt`. Split into chunks of 20-25 files each. Each image gets its own chunk (vision needs separate context). When splitting, group files from the same directory together so related artifacts land in the same chunk and cross-file relationships are more likely to be extracted.
**Step B2 - Sequential extraction (OpenClaw)**
Process each file one at a time. For each file:
1. Read the file contents
2. Extract nodes, edges, and hyperedges applying the same rules:
- EXTRACTED: relationship explicit in source (import, call, citation)
- INFERRED: reasonable inference (shared structure, implied dependency)
- AMBIGUOUS: uncertain — flag it, do not omit
- Code files: semantic edges AST cannot find. Do not re-extract imports.
- Doc/paper files: named concepts, entities, citations. Store rationale (WHY decisions were made) as a `rationale` attribute on the relevant node, not as a separate node. Use `file_type:"rationale"` for concept-like nodes (ideas, principles, mechanisms). Do NOT invent file_types like `concept`. When adding `calls` edges: source is caller, target is callee.
- Image files: use vision — understand what the image IS, not just OCR
- DEEP_MODE (if --mode deep): be aggressive with INFERRED edges
- Semantic similarity: if two concepts solve the same problem without a structural link, add `semantically_similar_to` INFERRED edge (confidence 0.6-0.95). Non-obvious cross-file links only.
- Hyperedges: if 3+ nodes share a concept/flow not captured by pairwise edges, add a hyperedge. Max 3 per file.
- confidence_score REQUIRED on every edge: EXTRACTED=1.0, INFERRED=0.6-0.9 (reason individually), AMBIGUOUS=0.1-0.3
3. Accumulate results across all files
Schema for each file's output:
{"nodes":[{"id":"filestem_entityname","label":"Human Readable Name","file_type":"code|document|paper|image|rationale","source_file":"relative/path","source_location":null,"source_url":null,"captured_at":null,"author":null,"contributor":null}],"edges":[{"source":"node_id","target":"node_id","relation":"calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to|rationale_for","confidence":"EXTRACTED|INFERRED|AMBIGUOUS","confidence_score":1.0,"source_file":"relative/path","source_location":null,"weight":1.0}],"hyperedges":[{"id":"snake_case_id","label":"Human Readable Label","nodes":["node_id1","node_id2","node_id3"],"relation":"participate_in|implement|form","confidence":"EXTRACTED|INFERRED","confidence_score":0.75,"source_file":"relative/path"}],"input_tokens":0,"output_tokens":0}
After processing all files, write the accumulated result to `.graphify_semantic_new.json`.
**Step B3 - Cache and merge**
For the accumulated result:
If more than half the chunks failed, stop and tell the user.
Merge all chunk files into `.graphify_semantic_new.json`. **After each Agent call completes, read the real token counts from the Agent tool result's `usage` field and write them back into the chunk JSON before merging** — the chunk JSON itself always has placeholder zeros. Then run:
```bash
$(cat .graphify_python) -c "
import json, glob
from pathlib import Path
chunks = sorted(glob.glob('.graphify_chunk_*.json'))
all_nodes, all_edges, all_hyperedges = [], [], []
total_in, total_out = 0, 0
for c in chunks:
d = json.loads(Path(c).read_text())
all_nodes += d.get('nodes', [])
all_edges += d.get('edges', [])
all_hyperedges += d.get('hyperedges', [])
total_in += d.get('input_tokens', 0)
total_out += d.get('output_tokens', 0)
Path('.graphify_semantic_new.json').write_text(json.dumps({
'nodes': all_nodes, 'edges': all_edges, 'hyperedges': all_hyperedges,
'input_tokens': total_in, 'output_tokens': total_out,
}, indent=2))
print(f'Merged {len(chunks)} chunks: {total_in:,} in / {total_out:,} out tokens')
"
```
Save new results to cache:
```bash
$(cat .graphify_python) -c "
import json
from graphify.cache import save_semantic_cache
from pathlib import Path
new = json.loads(Path('.graphify_semantic_new.json').read_text()) if Path('.graphify_semantic_new.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
saved = save_semantic_cache(new.get('nodes', []), new.get('edges', []), new.get('hyperedges', []))
print(f'Cached {saved} files')
"
```
Merge cached + new results into `.graphify_semantic.json`:
```bash
$(cat .graphify_python) -c "
import json
from pathlib import Path
cached = json.loads(Path('.graphify_cached.json').read_text()) if Path('.graphify_cached.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
new = json.loads(Path('.graphify_semantic_new.json').read_text()) if Path('.graphify_semantic_new.json').exists() else {'nodes':[],'edges':[],'hyperedges':[]}
all_nodes = cached['nodes'] + new.get('nodes', [])
all_edges = cached['edges'] + new.get('edges', [])
all_hyperedges = cached.get('hyperedges', []) + new.get('hyperedges', [])
seen = set()
deduped = []
for n in all_nodes:
if n['id'] not in seen:
seen.add(n['id'])
deduped.append(n)
merged = {
'nodes': deduped,
'edges': all_edges,
'hyperedges': all_hyperedges,
'input_tokens': new.get('input_tokens', 0),
'output_tokens': new.get('output_tokens', 0),
}
Path('.graphify_semantic.json').write_text(json.dumps(merged, indent=2))
print(f'Extraction complete - {len(deduped)} nodes, {len(all_edges)} edges ({len(cached[\"nodes\"])} from cache, {len(new.get(\"nodes\",[]))} new)')
"
```
Clean up temp files: `rm -f .graphify_cached.json .graphify_uncached.txt .graphify_semantic_new.json`
##### Part C - Merge AST + semantic into final extraction
```bash
$(cat .graphify_python) -c "
import sys, json
from pathlib import Path
ast = json.loads(Path('.graphify_ast.json').read_text())
sem = json.loads(Path('.graphify_semantic.json').read_text())
# Merge: AST nodes first, semantic nodes deduplicated by id
seen = {n['id'] for n in ast['nodes']}
merged_nodes = list(ast['nodes'])
for n in sem['nodes']:
if n['id'] not in seen:
merged_nodes.append(n)
seen.add(n['id'])
merged_edges = ast['edges'] + sem['edges']
merged_hyperedges = sem.get('hyperedges', [])
merged = {
'nodes': merged_nodes,
'edges': merged_edges,
'hyperedges': merged_hyperedges,
'input_tokens': sem.get('input_tokens', 0),
'output_tokens': sem.get('output_tokens', 0),
}
Path('.graphify_extract.json').write_text(json.dumps(merged, indent=2))
total = len(merged_nodes)
edges = len(merged_edges)
print(f'Merged: {total} nodes, {edges} edges ({len(ast[\"nodes\"])} AST + {len(sem[\"nodes\"])} semantic)')
"
```
#### Step 4 - Build graph, cluster, analyze, generate outputs
```bash
mkdir -p graphify-out
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.cluster import cluster, score_all
from graphify.analyze import god_nodes, surprising_connections, suggest_questions
from graphify.report import generate
from graphify.export import to_json
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
detection = json.loads(Path('.graphify_detect.json').read_text())
G = build_from_json(extraction)
communities = cluster(G)
cohesion = score_all(G, communities)
tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
gods = god_nodes(G)
surprises = surprising_connections(G, communities)
labels = {cid: 'Community ' + str(cid) for cid in communities}
# Placeholder questions - regenerated with real labels in Step 5
questions = suggest_questions(G, communities, labels)
report = generate(G, communities, cohesion, labels, gods, surprises, detection, tokens, 'INPUT_PATH', suggested_questions=questions)
Path('graphify-out/GRAPH_REPORT.md').write_text(report)
to_json(G, communities, 'graphify-out/graph.json')
analysis = {
'communities': {str(k): v for k, v in communities.items()},
'cohesion': {str(k): v for k, v in cohesion.items()},
'gods': gods,
'surprises': surprises,
'questions': questions,
}
Path('.graphify_analysis.json').write_text(json.dumps(analysis, indent=2))
if G.number_of_nodes() == 0:
print('ERROR: Graph is empty - extraction produced no nodes.')
print('Possible causes: all files were skipped, binary-only corpus, or extraction failed.')
raise SystemExit(1)
print(f'Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges, {len(communities)} communities')
"
```
If this step prints `ERROR: Graph is empty`, stop and tell the user what happened - do not proceed to labeling or visualization.
Replace INPUT_PATH with the actual path.
#### Step 5 - Label communities
Read `.graphify_analysis.json`. For each community key, look at its node labels and write a 2-5 word plain-language name (e.g. "Attention Mechanism", "Training Pipeline", "Data Loading").
Then regenerate the report and save the labels for the visualizer:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.cluster import score_all
from graphify.analyze import god_nodes, surprising_connections, suggest_questions
from graphify.report import generate
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
detection = json.loads(Path('.graphify_detect.json').read_text())
analysis = json.loads(Path('.graphify_analysis.json').read_text())
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
tokens = {'input': extraction.get('input_tokens', 0), 'output': extraction.get('output_tokens', 0)}
# LABELS - replace these with the names you chose above
labels = LABELS_DICT
# Regenerate questions with real community labels (labels affect question phrasing)
questions = suggest_questions(G, communities, labels)
report = generate(G, communities, cohesion, labels, analysis['gods'], analysis['surprises'], detection, tokens, 'INPUT_PATH', suggested_questions=questions)
Path('graphify-out/GRAPH_REPORT.md').write_text(report)
Path('.graphify_labels.json').write_text(json.dumps({str(k): v for k, v in labels.items()}))
print('Report updated with community labels')
"
```
Replace `LABELS_DICT` with the actual dict you constructed (e.g. `{0: "Attention Mechanism", 1: "Training Pipeline"}`).
Replace INPUT_PATH with the actual path.
#### Step 6 - Generate Obsidian vault (opt-in) + HTML
**Generate HTML always** (unless `--no-viz`). **Obsidian vault only if `--obsidian` was explicitly given** — skip it otherwise, it generates one file per node.
If `--obsidian` was given:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_obsidian, to_canvas
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
analysis = json.loads(Path('.graphify_analysis.json').read_text())
labels_raw = json.loads(Path('.graphify_labels.json').read_text()) if Path('.graphify_labels.json').exists() else {}
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
cohesion = {int(k): v for k, v in analysis['cohesion'].items()}
labels = {int(k): v for k, v in labels_raw.items()}
n = to_obsidian(G, communities, 'graphify-out/obsidian', community_labels=labels or None, cohesion=cohesion)
print(f'Obsidian vault: {n} notes in graphify-out/obsidian/')
to_canvas(G, communities, 'graphify-out/obsidian/graph.canvas', community_labels=labels or None)
print('Canvas: graphify-out/obsidian/graph.canvas - open in Obsidian for structured community layout')
print()
print('Open graphify-out/obsidian/ as a vault in Obsidian.')
print(' Graph view - nodes colored by community (set automatically)')
print(' graph.canvas - structured layout with communities as groups')
print(' _COMMUNITY_* - overview notes with cohesion scores and dataview queries')
"
```
Generate the HTML graph (always, unless `--no-viz`):
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_html
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
analysis = json.loads(Path('.graphify_analysis.json').read_text())
labels_raw = json.loads(Path('.graphify_labels.json').read_text()) if Path('.graphify_labels.json').exists() else {}
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
labels = {int(k): v for k, v in labels_raw.items()}
if G.number_of_nodes() > 5000:
print(f'Graph has {G.number_of_nodes()} nodes - too large for HTML viz. Use Obsidian vault instead.')
else:
to_html(G, communities, 'graphify-out/graph.html', community_labels=labels or None)
print('graph.html written - open in any browser, no server needed')
"
```
#### Step 7 - Export formats (optional, only if flagged)
##### Neo4j export (only if --neo4j or --neo4j-push flag)
**If `--neo4j`** - generate a Cypher file for manual import:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_cypher
from pathlib import Path
G = build_from_json(json.loads(Path('.graphify_extract.json').read_text()))
to_cypher(G, 'graphify-out/cypher.txt')
print('cypher.txt written - import with: cypher-shell < graphify-out/cypher.txt')
"
```
**If `--neo4j-push <uri>`** - push directly to a running Neo4j instance. Ask the user for credentials if not provided:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.cluster import cluster
from graphify.export import push_to_neo4j
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
analysis = json.loads(Path('.graphify_analysis.json').read_text())
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
result = push_to_neo4j(G, uri='NEO4J_URI', user='NEO4J_USER', password='NEO4J_PASSWORD', communities=communities)
print(f'Pushed to Neo4j: {result[\"nodes\"]} nodes, {result[\"edges\"]} edges')
"
```
Replace `NEO4J_URI`, `NEO4J_USER`, `NEO4J_PASSWORD` with actual values. Default URI is `bolt://localhost:7687`, default user is `neo4j`. Uses MERGE - safe to re-run without creating duplicates.
##### SVG export (only if --svg flag)
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_svg
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
analysis = json.loads(Path('.graphify_analysis.json').read_text())
labels_raw = json.loads(Path('.graphify_labels.json').read_text()) if Path('.graphify_labels.json').exists() else {}
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
labels = {int(k): v for k, v in labels_raw.items()}
to_svg(G, communities, 'graphify-out/graph.svg', community_labels=labels or None)
print('graph.svg written - embeds in Obsidian, Notion, GitHub READMEs')
"
```
##### GraphML export (only if --graphml flag)
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_graphml
from pathlib import Path
extraction = json.loads(Path('.graphify_extract.json').read_text())
analysis = json.loads(Path('.graphify_analysis.json').read_text())
G = build_from_json(extraction)
communities = {int(k): v for k, v in analysis['communities'].items()}
to_graphml(G, communities, 'graphify-out/graph.graphml')
print('graph.graphml written - open in Gephi, yEd, or any GraphML tool')
"
```
##### MCP server (only if --mcp flag)
```bash
python3 -m graphify.serve graphify-out/graph.json
```
This starts a stdio MCP server that exposes tools: `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`. Add to Claude Desktop or any MCP-compatible agent orchestrator so other agents can query the graph live.
To configure in Claude Desktop, add to `claude_desktop_config.json`:
```json
{
"mcpServers": {
"graphify": {
"command": "python3",
"args": ["-m", "graphify.serve", "/absolute/path/to/graphify-out/graph.json"]
}
}
}
```
#### Step 8 - Token reduction benchmark (only if total_words > 5000)
If `total_words` from `.graphify_detect.json` is greater than 5,000, run:
```bash
$(cat .graphify_python) -c "
import json
from graphify.benchmark import run_benchmark, print_benchmark
from pathlib import Path
detection = json.loads(Path('.graphify_detect.json').read_text())
result = run_benchmark('graphify-out/graph.json', corpus_words=detection['total_words'])
print_benchmark(result)
"
```
Print the output directly in chat. If `total_words <= 5000`, skip silently - the graph value is structural clarity, not token compression, for small corpora.
---
#### Step 9 - Save manifest, update cost tracker, clean up, and report
```bash
$(cat .graphify_python) -c "
import json
from pathlib import Path
from datetime import datetime, timezone
from graphify.detect import save_manifest
# Save manifest for --update
detect = json.loads(Path('.graphify_detect.json').read_text())
save_manifest(detect['files'])
# Update cumulative cost tracker
extract = json.loads(Path('.graphify_extract.json').read_text())
input_tok = extract.get('input_tokens', 0)
output_tok = extract.get('output_tokens', 0)
cost_path = Path('graphify-out/cost.json')
if cost_path.exists():
cost = json.loads(cost_path.read_text())
else:
cost = {'runs': [], 'total_input_tokens': 0, 'total_output_tokens': 0}
cost['runs'].append({
'date': datetime.now(timezone.utc).isoformat(),
'input_tokens': input_tok,
'output_tokens': output_tok,
'files': detect.get('total_files', 0),
})
cost['total_input_tokens'] += input_tok
cost['total_output_tokens'] += output_tok
cost_path.write_text(json.dumps(cost, indent=2))
print(f'This run: {input_tok:,} input tokens, {output_tok:,} output tokens')
print(f'All time: {cost[\"total_input_tokens\"]:,} input, {cost[\"total_output_tokens\"]:,} output ({len(cost[\"runs\"])} runs)')
"
rm -f .graphify_detect.json .graphify_extract.json .graphify_ast.json .graphify_semantic.json .graphify_analysis.json .graphify_labels.json .graphify_chunk_*.json
rm -f graphify-out/.needs_update 2>/dev/null || true
```
Tell the user (omit the obsidian line unless --obsidian was given):
```
Graph complete. Outputs in PATH_TO_DIR/graphify-out/
graph.html - interactive graph, open in browser
GRAPH_REPORT.md - audit report
graph.json - raw graph data
obsidian/ - Obsidian vault (only if --obsidian was given)
```
If graphify saved you time, consider supporting it: https://github.com/sponsors/safishamsi
Replace PATH_TO_DIR with the actual absolute path of the directory that was processed.
Then paste these sections from GRAPH_REPORT.md directly into the chat:
- God Nodes
- Surprising Connections
- Suggested Questions
Do NOT paste the full report - just those three sections. Keep it concise.
Then immediately offer to explore. Pick the single most interesting suggested question from the report - the one that crosses the most community boundaries or has the most surprising bridge node - and ask:
> "The most interesting question this graph can answer: **[question]**. Want me to trace it?"
If the user says yes, use the `graphify_query` extension tool to answer the question and walk them through it using the graph structure. Keep going as long as they want to explore. Each answer should end with a natural follow-up ("this connects to X - want to go deeper?") so the session feels like navigation, not a one-shot report.
The graph is the map. Your job after the pipeline is to be the guide.
---
## For --update (incremental re-extraction)
Use when you've added or modified files since the last run. Only re-extracts changed files - saves tokens and time.
**If all changed files are code files:** use the `graphify_update` extension tool — it handles code-only updates without LLM semantic extraction.
**If changed files include docs, papers, or images:** follow the full pipeline Steps 19, but replace Step 2's `detect` with `detect_incremental`:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.detect import detect_incremental, save_manifest
from pathlib import Path
result = detect_incremental(Path('INPUT_PATH'))
new_total = result.get('new_total', 0)
print(json.dumps(result, indent=2))
Path('.graphify_incremental.json').write_text(json.dumps(result))
if new_total == 0:
print('No files changed since last run. Nothing to update.')
raise SystemExit(0)
print(f'{new_total} new/changed file(s) to re-extract.')
"
```
Then check whether all changed files are code files:
```bash
$(cat .graphify_python) -c "
import json
from pathlib import Path
result = json.loads(open('.graphify_incremental.json').read()) if Path('.graphify_incremental.json').exists() else {}
code_exts = {'.py','.ts','.js','.go','.rs','.java','.cpp','.c','.rb','.swift','.kt','.cs','.scala','.php','.cc','.cxx','.hpp','.h','.kts'}
new_files = result.get('new_files', {})
all_changed = [f for files in new_files.values() for f in files]
code_only = all(Path(f).suffix.lower() in code_exts for f in all_changed)
print('code_only:', code_only)
"
```
If `code_only` is True: print `[graphify update] Code-only changes detected - skipping semantic extraction (no LLM needed)`, run only Step 3A (AST) on the changed files, skip Step 3B entirely (no subagents), then go straight to merge and Steps 48.
If `code_only` is False (any changed file is a doc/paper/image): run the full Steps 3A3C pipeline as normal.
Then:
```bash
$(cat .graphify_python) -c "
import sys, json
from graphify.build import build_from_json
from graphify.export import to_json
from networkx.readwrite import json_graph
import networkx as nx
from pathlib import Path
# Load existing graph
existing_data = json.loads(Path('graphify-out/graph.json').read_text())
G_existing = json_graph.node_link_graph(existing_data, edges='links')
# Load new extraction
new_extraction = json.loads(Path('.graphify_extract.json').read_text())
G_new = build_from_json(new_extraction)
# Merge: new nodes/edges into existing graph
G_existing.update(G_new)
print(f'Merged: {G_existing.number_of_nodes()} nodes, {G_existing.number_of_edges()} edges')
"
```
Then run Steps 48 on the merged graph as normal.
After Step 4, show the graph diff:
```bash
$(cat .graphify_python) -c "
import json
from graphify.analyze import graph_diff
from graphify.build import build_from_json
from networkx.readwrite import json_graph
import networkx as nx
from pathlib import Path
# Load old graph (before update) from backup written before merge
old_data = json.loads(Path('.graphify_old.json').read_text()) if Path('.graphify_old.json').exists() else None
new_extract = json.loads(Path('.graphify_extract.json').read_text())
G_new = build_from_json(new_extract)
if old_data:
G_old = json_graph.node_link_graph(old_data, edges='links')
diff = graph_diff(G_old, G_new)
print(diff['summary'])
if diff['new_nodes']:
print('New nodes:', ', '.join(n['label'] for n in diff['new_nodes'][:5]))
if diff['new_edges']:
print('New edges:', len(diff['new_edges']))
"
```
Before the merge step, save the old graph: `cp graphify-out/graph.json .graphify_old.json`
Clean up after: `rm -f .graphify_old.json`
---
## For --cluster-only
Use the `graphify_cluster` extension tool.
---
## For query, path, explain, add, watch
Use the corresponding extension tools:
| Subcommand | Tool |
|------------|------|
| `query` | `graphify_query` |
| `path` | `graphify_path` |
| `explain` | `graphify_explain` |
| `add` | `graphify_add` |
| `watch` | `graphify_watch` |
---
## Honesty Rules
- Never invent an edge. If unsure, use AMBIGUOUS.
- Never skip the corpus check warning.
- Always show token cost in the report.
- Never hide cohesion scores behind symbols - show the raw number.
- Never run HTML viz on a graph with more than 5,000 nodes without warning the user.